1 Introduction

Knowledge Graph (KG) is a kind of large-scale structured semantic network whose nodes represent entities and edges represent relations between entities [9]. The rise of KGs in the past years has made great contributions to the success of many applications such as entity linking [10], recommendation [36] and question answering [25].

As more and more entities involved in a KG, a large portion of relations between entities might be missing. To deal with the case, the task of Knowledge Graph Completion (KGC) is proposed, aiming at complementing missing relation between entities in KGs. In the past years, a lot of attention has been paid on this topic, which can be roughly divided into closed-world KGC approaches and open-world KGC approaches.

The closed-world KGC approaches tend to utilize the knowledge within KGs. The most active closed-world KGC methods are based on the knowledge graph embedding models such as TransE [4] and its variances [14, 21, 38]. By encoding the entities and relations between entities in KGs into a continuous low-dimensional embedding vectors space, we could do some inference to identify some hidden relations between entities. However, closed-world KGC approaches could only complement very limited number of relation, i.e., usually lead to a low recall. On the other hand, some work tends to get knowledge from open-world resources such as online encyclopedias and newswire corpus. For instance, Description-Embodied Knowledge Representation Learning (DKRL) [39] proposes to learn the representations of entities from not only TransE, but also the description of the entities in online-encyclopedias. To achieve this, DKRL adopts to do a joint training for graph-based embeddings and description-based embeddings. They use continuous bag-of-words and deep convolutional neural network models to encode semantics of entity descriptions. However, it does not take into account that various relationships focus on different parts of the entity description, and not all information provided in its entity description is useful to predict linked entities of a given specific relationship.

A recent proposed open-world KGC model called ConMask [30] learns embeddings of an entity’s name and parts of its text-description to connect unseen entities to the KGs. As illustrated in Figure 1, it first uses a so-called relationship-dependent content masking approach to select the words related to the given relationship in the relevant text description, which could effectively mitigate the presence of noisy text descriptions. Next, it trains a Fully Convolutional Neural network (FCN) to extract the word-based target entity embeddings from relevant text descriptions. Finally, the extracted word-based target entity embeddings and other textual features (entity names) are compared with the existing target candidate entities in the KG to resolve a ranked list of target candidate entities.

Fig. 1
figure 1

Framework of Conmask model, where kd is the length of the entity description, kn is the length of the relationship name and kemb is the word embedding size

However, there are at least two weaknesses with the ConMask model. Firstly, the information of entity descriptions is not fully used. Now only the pre-trained word embeddings, is used for the representation of words in the text descriptions, some potential semantic and statistic information might be missing. Secondly, the proposed relationship-dependent content masking model could only find possible target-words that appear in the fixed-size content masking window after the indicator word, without considering that the situation that the target-words could also appear in front of the indicator word. Besides, it is difficult to determine a proper size for the content masking window.

In our previous work [7], we propose a novel open-world KGC approach based on the same input resources with ConMask, i.e., entity names, relationship names, and entity descriptions. But different from ConMask which only uses the entity descriptions in a very simple way, we propose to use attention mechanisms to fully capture the important information generated from the multiple interactions between entity names, relationship names and entity descriptions. More specifically, the multiple interactions involved in the model include: (1) The interaction between the head entity name, the relationship name, and the head entity description. (2) The interaction between the head entity, the relationship, and the candidate tail entity. (3) The interaction between the description of the head entity and the candidate tail entity descriptions In this way, our Multiple Interaction Attention (MIA) model could not only flexibly select relevant parts of the entity description according to different relationships, but also better aware of the relevant part in the head entity description and obtain the head-aware representation of the candidate tail entity description.

In this paper, We introduce the interaction between multiple candidate tail entity descriptions so that our model can explore the hidden relationships among multiple tail entity descriptions and use such relationships to form the interactions between them and enhance their representation. In addition, in the final layer of the model, we also design several scoring functions to compare the convergence ability of the model under different functions and to enhance the effectiveness of our model.

Besides, to make effective use of the rich information in the entity descriptions, our model encodes the head entity description, head entity name, the relationship name, and the candidate tail entity descriptions into word representations which are enhanced by additional Part-Of-Speech (POS) tags, Named-Entity-Recognition (NER) tags and handcrafted textual features.

To summarize, our contributions in this paper can be summarized as follows:

  • We propose to use attention mechanism to simulate the interaction between the head entity name, the relationship name and the entity descriptions, such that we could dynamically select the most related information from the head entity description and the candidate tail entity descriptions according to different relations.

  • We use the attention mechanism to align relevant parts between the head entity description and the candidate tail entity descriptions, such that we could enrich the representation of the candidate tail entity description.

  • We use the attention mechanism to obtain hidden relationships between multiple candidate tail entities and use them to enhance the representation of them.

  • We also propose to make effective use of the rich information in the entity descriptions with some additional important features.

Our empirical study conducted on three real-world data collections shows that our approach achieves significant improvements on open-world KGC compared with state-of-the-art methods.

The rest of this paper is organized as follows: We cover the related work in Section 2, and then define our problem and introduce the framework of our approach in Section 3. After giving details of MIA model in Section 4, we report the empirical study in Section 5. We finally conclude the paper in Section 6.

2 Related work

Knowledge graph completion (KGC) aims at completing the missing relation between entities in given KG (or KGs). So far, a lot of attention has been paid on this topic, which can be roughly divided into closed-world KGC approaches and open-world KGC approaches.

2.1 The closed-world knowledge graph completion

The closed-world KGC approaches tend to utilize the knowledge within KGs. The most active closed-world KGC methods are based on the knowledge graph embedding models such as TransE [4] and its variances [1, 14, 16, 21, 23, 34, 38]. By encoding the entities and relations between entities in the KG into a continuous low-dimensional embedding vectors space, we could do some inference to identify some hidden relations between entities.

For a given triple (head entity, relationship, tail entity), also denoted as (h,r,t), the typical embedding-based KGC model TransE [4] assumes the energy function is defined as

$$ E(h,r,t) = \parallel \mathbf{h} + \mathbf{r} - \mathbf{t}\parallel, $$
(1)

which indicates that the tail embedding t should be the closeness neighbour of h + r, where h, r are embeddings of head entity and relationship respectively. There are also many models that introduce more relationship-dependent parameters. TransR [21], hMr + r = tMr where Mr is a relationship-dependent entity embedding transformation. TransR [21] models entities and relations in distinct semantic space (entity space and relation spaces) and performs translation in relation space when learning embeddings. PTransE [20] maintain a simple translation function and proposes a multiple-step relation path-based representation learning model. SimplE [16] and TuckER [1] employ tensor decomposition to train and obtain representations of head entities, tail entities and relations. RotatE [34] designs the model for three special relationships: symmetric/antisymmetric, inversion, and composition. Liu et al. [22, 24] focus on the optimal social trust path selection problem in complex social networks.

Unlike topology-based models that have been extensively studied, there are several methods that use textual information for KGC. For instance, the Neural Tensor Networks (NTN) model [31] initializes the representation of the entity by using the average word embedding in entity name, and allow sharing of textual information located in similar entity names. Zhang et al. [42] represents entities with entity names or the average of word embeddings in descriptions. Jointly [40] first uses the weighted sum combination topology-embeddings and text-embeddings, and then calculates the Ln distance between the translated head and tail entities. However, closed-world KGC approaches could only complement very limited number of relations, i.e., usually lead to a low recall.

2.2 The open-world knowledge graph completion

More recent work tends to get knowledge from open-world resources such as online encyclopedias and newswire corpus. In traditional research, such external knowledge is often used to explore new relationships in KGs, which is often called relation extraction [26]. The common applications tend to use neural networks such as CNN [40] or abstract meaning representations (AMRs) [13]. When it comes to KGC, DKRL [39] uses a joint training of graph-based embeddings and description-based embeddings. They use continuous bag-of-words and deep convolutional neural network models to encode semantics of entity descriptions. It can directly build representations from the description of the novel entities. A recent work proposes ConMask [30] model, which is a text-focused approach that could reduce irrelevant and noisy words by selecting words associated with relationships in the given entity description, and then fuse the relevant text through fully convolutional neural networks (FCN) to extract the word-based entity embedding and combined with background representations of other textual features (entity names) to connect unseen entities to the KG.

2.3 Text-focused studies in machine reading comprehension

ConMask is designed to get information from the description text to help with KGC. The problem of extracting required information from texts for a given question is well studied in the field of machine reading comprehension [15, 32, 35, 43, 44]. There are two main research directions of machine reading comprehension: generative reading comprehension and multiple-choice reading comprehension. The goal of the former is to extract the answer of the question from a given text and question where the dominant trend is a variety of attention-based interactions between text and question. For example, Kadlec et al. propose a method called “Attention Sum Reader” [15], which directly extracts the answer from the text using attention mechanism. Recently, Cui et al. propose an attention-over-attention(AoA Reader) neural networks for reading comprehension, which reduces the computational complexity of the model. The performance is improved further due to the usage of document-level attention. When it comes to multiple-choice reading comprehension, researchers try to introduce information in candidate answers into the model. Trischler et al. [35] propose a parallel-hierarchical neural model that matches the texts, questions and candidate answers from from word level to sentence level. However the model must be trained with the training wheel [32] to converge. In recent work, Zhu et al. [44] propose a model which uses hierarchical attention flow to enhance the interaction between the candidate answers and the text. The experimental results on their datasets show that their method significantly outperforms various state-of-the-art systems.

3 Problem definition and the framework

We formally define the Knowledge Graph Completion (KGC) task below:

Definition 1 (Knowledge Graph Completion (KGC))

Given a Knowledge Graph \(\mathcal {K}\mathcal {G}\) with a set of incomplete relation triples in the form of (h,r,?), where h denotes the head entity, r denotes the relation, and the ? is the missing tail entity t, the task of Knowledge Graph Completion (KGC) is to find t for each incomplete relation triple to consist a complete one (h,r,t).

To illustrate how our Multiple Interaction Attention (MIA) model works on open-world KGC task, several examples are given in Figure 2. For a given partial triple (Donald_Trump, mother, ?), if a human reader were asked to determine from the head entity description and some candidate tail entity descriptions, “Who is the mother of US President Donald Trump?”, then human reader will first look for contextual clues such as mother, parent or family-related information from the description of the head entity “Donald_Trump”. Here, the human reader has located the sentence “His parents were \(\dots \) and Scottish-born housewife Mary Anne MacLeod” in the head entity description. So, the human reader may infer that Donald Trump’s mother is a Scottish-born housewife Mary Anne MacLeod. After that, the human reader locates the description of the candidate tail entity “Mary_Anne_MacLeod_Trump” from the candidate tail entity descriptions. In the description of “Mary_Anne_MacLeod_Trump”, the human reader will be pleasantly surprised to find “Mary Anne Trump was the mother of Donald Trump, the 45th president of the United States” and “Born in the Outer Hebrides of Scotland”. Therefore, the human reader can more accurately reason that “Mary_Anne_MacLeod_Trump” is the correct tail entity of the partial triple (Donald_Trump, mother, ?).

Fig. 2
figure 2

Open-world KGC examples with our MIA model

We split the above reasoning process into three steps below:

  1. 1.

    Locating task-related information in the head entity description and the candidate tail entity descriptions, respectively;

  2. 2.

    Extracting the context information of the related text in the head entity description and the candidate tail entity descriptions;

  3. 3.

    Matching the head entity description and candidate tail entity descriptions respective relevant text context information to determine the correct tail entity.

Correspondingly, the MIA model is designed to simulate this process, which is mainly composed of three components below:

  1. 1.

    Multiple Interaction Attention, which highlights task-related words;

  2. 2.

    Text Context Encoder, which encodes context information in the relevant text;

  3. 3.

    Matching Prediction, which chooses a correct tail entity by matching the context information in the relevant text to calculate the similarity score between the head entity description and the candidate tail entity descriptions.

Note that we consider that the head entity, relationship, and tail entity usually appear in a snippet of the text description at the same time, so we combine the head entity name with the relationship name into a question as an input to our model to help the model locate task-related information more effectively.

4 The MIA model

The MIA model first encodes the head entity description, question, and candidate tail entity descriptions into a word representation and enhances it by appending some other features. Then, it emphasizes and organizes relevant information by using a word-level attention mechanism to simulate the interaction between the head entity description, question and candidate tail entity descriptions. Afterwards, MIA uses Bidirectional Long Short-Term Memory network (Bi-LSTM) to encoded context information in the relevant text. After that, it uses a word-level attention mechanism between multiple candidates to enhance the representation of tail entities. Finally, through a matching prediction, it compares the representation extracted to the head entity description with the representation of each candidate tail entity description to resolve a ranked list of candidate tail entities. The architecture of MIA model is also illustrated in Figure 3.

Fig. 3
figure 3

Main neural architecture of the Multiple Interaction Attention (MIA) model

In the following of this section, we describe the details of the MIA model component by component. Throughout this section, we will use Dh for representing the head entity description, Qr for representation question consisting of the head entity name and the relationship name, Ct for the candidate tail entity descriptions, and \(C_{t_{i}}\) for the description of the i-th candidate in the candidates set for the same question to be completed. Note that since the description of the operations for each candidate tail entity are the same, for the sake of simplicity, we only take one of the candidate tail entity descriptions for illustration.

4.1 Input word representation

We transform each word in the head entity description, question, and candidate tail entity description into continuous representations. In this paper, each training example entered during training contains a head entity description \(\{w^{D_{h}}_{i}\}_{i=1}^{|D_{h}|}\), a question \(\{w^{Q_{r}}_{j}\}_{j=1}^{|Q_{r}|}\), a candidate tail entity description \(\{w^{C_{t}}_{n}\}_{n=1}^{|C_{t}|}\) and a label y∈{0,1}, where |Dh|, |Qr|, and |Ct| are the length of the head entity description, question, and candidate tail entity description.

Here, we take the input representation of the i-th word \(w^{D_{h}}_{i}\) in the given head entity description as an example, which is the concatenation of several components:

  • Word Embedding: We use the publicly available pre-trained GloVe [27] embedding \(\mathbf {E}_{w^{D_{h}}_{i}}^{word}\).

  • Part-Of-Speech (POS) Embedding: We use spaCyFootnote 1 for part-of-speech tagging \(\mathbf {E}_{w^{D_{h}}_{i}}^{pos}\). Similar to traditional word embeddings, we assign different trainable vectors for each part-of-speech tag.

  • Named-Entity-Recognition (NER) Embedding: Like POS, we use spaCy for named entity recognition \(\mathbf {E}_{w^{D_{h}}_{i}}^{ner}\).

  • Handcrafted Features Embedding: We use term frequency and co-occurrence feature as handcrafted features \(\mathbf {E}_{w^{D_{h}}_{i}}^{feat}\). The term frequency is calculated based on English Wikipedia. In the binary co-occurrence feature, it is true when \(w^{D_{h}}_{i}\) appears in the question \(\{w^{Q_{r}}_{j}\}_{j=1}^{|Q_{r}|}\) or candidate tail entity description \(\{w^{C_{t}}_{n}\}_{n=1}^{|C_{t}|}\).

  • BERT Embedding: BERT [6] is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. We use BERT-as-serviceFootnote 2 to get the word representation \(\mathbf {E}_{w^{D_{h}}_{i}}^{bert}\)

We concatenate five embedding components to form the final input representations for the word \(w^{D_{h}}_{i}\), namely \(\mathbf {E}_{w^{D_{h}}_{i}}\).

$$ \mathbf{E}_{w^{D_{h}}_{i}} = \left[\mathbf{E}_{w^{D_{h}}_{i}}^{word}; \mathbf{E}_{w^{D_{h}}_{i}}^{pos};\mathbf{E}_{w^{D_{h}}_{i}}^{ner}; \mathbf{E}_{w^{D_{h}}_{i}}^{feat}; \mathbf{E}_{w^{D_{h}}_{i}}^{bert}\right] $$
(2)

In the same way, we concatenate Word Embedding \(\mathbf {E}_{w^{Q_{r}}_{j}}^{word}\), POS Embedding \(\mathbf {E}_{w^{Q_{r}}_{j}}^{pos}\) and BERT Embedding \(\mathbf {E}_{w^{Q_{r}}_{j}}^{bert}\) to get the input word representation \(\mathbf {E}_{w^{Q_{r}}_{j}}\) of a word \(w^{Q_{r}}_{j}\) in a given question.

$$ \mathbf{E}_{w^{Q_{r}}_{j}} = \left[\mathbf{E}_{w^{Q_{r}}_{j}}^{word}; \mathbf{E}_{w^{Q_{r}}_{j}}^{pos};\mathbf{E}_{w^{Q_{r}}_{j}}^{bert}\right] $$
(3)

The input representation of the \(w^{C_{t_{i}}}_{n}\) for a given candidate tail entity description \(C_{t_{i}}\) contains Word Embedding \(\mathbf {E}_{w^{C_{t_{i}}}_{n}}^{word}\) and BERT Embedding \(\mathbf {E}_{w^{C_{t_{i}}}_{n}}^{bert}\).

$$ \mathbf{E}_{w^{C_{t_{i}}}_{n}} = \left[\mathbf{E}_{w^{C_{t_{i}}}_{n}}^{word};\mathbf{E}_{w^{C_{t_{i}}}_{n}}^{bert}\right] $$
(4)

4.2 Multiple interaction attention

In our model, we use the interaction between the head entity, the question, and the candidate tail entity to emphasize and organize relevant information accordingly. We exploit the same word-level sequence alignment attention mechanism for each interaction. In this section, we first describe the Word-level Sequence Alignment (WSA) attention mechanism in detail and then explain various interactions.

Word-level sequence alignment attention mechanism

Following [5, 18, 37] and other recent works, given two inputs X and \(\mathbf {Y} = \{\mathbf {Y}_{i}\}_{i=1}^{m}\), let’s define the attention function:

$$ att(\mathbf{X},\ \{\mathbf{Y}_{i}\}_{i=1}^{m}) = \sum\limits_{i=1}^{m} {a_{xy_{i}}\mathbf{Y}_{i}} $$
(5)
$$ a_{xy_{i}}= \text{softmax}(\alpha(\mathbf{W}\mathbf{X})^{T} \alpha(\mathbf{W}\mathbf{Y}_{i})) $$
(6)

where the attention score \(a_{xy_{i}}\) captures the similarity between X and each words Yi, W is a matrix, and α(⋅) is a activation function with ReLU nonlinearity.

Question-aware head entity description WSA attention

Note that words in the head entity description are not equally important, and the importance of them changes in tune with the different questions. Just like people find relevant answers from a given passage based on the question, people can always give more attention to the words that are most relevant to the question. Therefore, we can get the question-aware representation \(\mathbf {E}_{w^{D_{h}}_{i}}^{q_{r}}\) of the word \({w^{D_{h}}_{i}}\) in the head entity description according to the question:

$$ \mathbf{E}_{w^{D_{h}}_{i}}^{q_{r}} = att(\mathbf{E}_{w^{D_{h}}_{i}}^{word},\ \{\mathbf{E}_{w^{Q_{r}}_{j}}^{word}\}_{j=1}^{|Q_{r}|}) $$
(7)

Question-aware candidate tail entity description WSA attention

In a similar way, we use question information as the key to extracting important information from candidate tail entity description. For each candidate tail entity description \(C_{t_{i}}\), we get the question-aware representation \(\mathbf {E}_{w^{C_{t_{i}}}_{n}}^{q_{r}}\) of the word \(w^{C_{t_{i}}}_{n}\) in the candidate tail entity description:

$$ \mathbf{{E}}_{w^{C_{t_{i}}}_{n}}^{q_{r}} = att(\mathbf{E}_{w^{C_{t_{i}}}_{n}}^{word},\ \{\mathbf{E}_{w^{Q_{r}}_{j}}^{word}\}_{j=1}^{|Q_{r}|}) $$
(8)

Head-aware candidate tail entity description WSA attention

We find when those entities have relationships, they usually mention to each other in each other’s descriptions. In order to adequately leverage the information in the head entity description, we align the candidate tail entity description with the head entity descriptions. In details, we embed the information of the head entity description into the candidate tail entity description representation so that we can better align and aware the relevant parts of the head entity description. Thereby the word \(w^{C_{t_{i}}}_{n}\) in the candidate tail entity description \(C_{t_{i}}\) can obtain the aware representation of the head entity description with the following equation:

$$ \mathbf{E}_{w^{C_{t_{i}}}_{n}}^{d_{h}} = att\left( \mathbf{E}_{w^{C_{t_{i}}}_{n}}^{word},\ \left\{\mathbf{E}_{w^{D_{h}}_{i}}^{word}\right\}_{i=1}^{|D_{h}|}\right) $$
(9)

4.3 Text context encoder

The third component of the model is the Recurrent Neural Network (RNN) layer which uses a Bidirectional Long Short-Term Memory network (Bi-LSTM) [8, 29] to model the contextual information. In addition, after RNN layer, an attention mechanism between multiple candidate tail entity descriptions is applied to obtain the enhanced representation of these descriptions.

Recurrent neural network layer

In order to learn long-term dependencies [2, 11, 12] in RNN, Long Short-Term Memory network (LSTM) was proposed by [12]. The Bi-LSTM consists of two independent LSTMs, the forward LSTM and the backward LSTM. By using three separate Bi-LSTMs, we encode the head entity description, question and candidate tail entity description as follows:

$$ \mathbf{B}^{D_{h}} = \text{Bi-LSTM}\left( \left\{\left[\mathbf{E}_{w^{D_{h}}_{i}}; \mathbf{E}_{w^{D_{h}}_{i}}^{q_{r}}\right]\right\}_{i=1}^{|D_{h}|}\right) $$
(10)
$$ \mathbf{B}^{Q_{r}} = \text{Bi-LSTM}\left( \left\{\mathbf{E}_{w^{Q_{r}}_{j}}\right\}_{j=1}^{|Q_{r}|}\right) $$
(11)
$$ \widehat{\mathbf{B}}^{C_{t_{i}}} = \text{Bi-LSTM}\left( \left\{\left[\mathbf{E}_{w^{C_{t_{i}}}_{n}}; \mathbf{E}_{w^{C_{t}}_{n}}^{q_{r}}; \mathbf{E}_{w^{C_{t_{i}}}_{n}}^{d_{h}}\right]\right\}_{n=1}^{|C_{t_{i}}|}\right) $$
(12)

Attention between multiple candidate tail entity descriptions

The candidate tail entity representation \(\widehat {\mathbf {B}}^{C_{t_{i}}} \) is generated by the WSA attention which is aware of the question and the head entity description. However such representation is independent of other candidates and does not encode the hidden relationship information between the candidates. Inspired by Zhu et al. [44], there are also hidden relationships between candidate tail entities that are helpful in finding the right answer. For example, fragments of the correct candidate tail entity description may appear frequently in the descriptions of other candidate tail entities. So we design a new attention layer to explore the hidden relationships between candidates and obtain new candidate tail entity representation \(\widehat {\widehat {\mathbf {B}}}^{C_{t_{i}}} \). We train a matrix Wcc to calculate the impact factors between the candidates, which are used as weights in the subsequent aggregation process.

$$ A_{i,j}=\widehat{\mathbf{B}}^{C_{t_{i}}} W_{cc} \widehat{\mathbf{B}}^{C_{t_{j}}} $$
(13)
$$ w_{i,j}=\frac{exp(A_{i,j})}{{\sum}_{k=0}^{m}{exp(A_{i,k})}} $$
(14)

where m is the count of candidate tail entities.

$$ \widehat{\widehat{\mathbf{B}}}^{C_{t_{i}}} =\sum\limits_{k=0}^{m}{w_{i,k} \widehat{\mathbf{B}}}^{C_{t_{k}}} $$
(15)

Then, we model the candidate correlations with difference \({\widehat {\mathbf {B}}}^{C_{t_{i}}}- \widehat {\widehat {\mathbf {B}}}^{C_{t_{i}}} \), which is inspired by Chen et al. [28]. At last, we concatenate the difference to the independent candidate representation for enhancement.

$$ \mathbf{B}^{C_{t_{i}}} =[ \widehat{\mathbf{B}}^{C_{t_{i}}};\widehat{\mathbf{B}}^{C_{t_{i}}}- \widehat{\widehat{\mathbf{B}}}^{C_{t_{i}}}] $$
(16)

4.4 Matching prediction

We use the self-attention [41] to summarize the question sequence representation \(\mathbf {B}^{Q_{r}}\) into the final question representation \(\mathbf {R}_{Q_{r}}\). The definition of the self-attention function is as follows:

$$ att_{self}(\{\mathbf{X}_{i}\}_{i=1}^{m}) = \sum\limits_{i=1}^{m}{a_{i} \mathbf{X}_{i}} $$
(17)
$$ a_{i} = \text{softmax}\left( \mathbf{W}_{self}^{T} \mathbf{X}_{i} \right) $$
(18)

where the attention score ai indicates the importance of Xi in \(\{\mathbf {X}_{i}\}_{i=1}^{m}\).

According to the question representation \(\mathbf {R}_{Q_{r}} = att_{self}(\{\mathbf {B}^{Q_{r}}_{j}\}_{j=1}^{|Q_{r}|})\), we can get the head entity description representation \(\mathbf {R}_{D_{h}} = att(\mathbf {R}_{Q_{r}}, \{\mathbf {B}^{D_{h}}_{i}\}_{i=1}^{|D_{h}|})\), and the i-th candidate tail entity description representation \(\mathbf {R}_{C_{t_{i}}}= att(\mathbf {R}_{Q_{r}}, \{\mathbf {B}^{C_{t_{i}}}_{n}\}_{n=1}^{|C_{t_{i}}|})\).

Instead of simply multiplying two vectors to get the score, we try a variety of functions to get the final score, mainly including the following three functions:

$$ Score_{linear}(\mathbf{R}_{D_{h}},\mathbf{R}_{C_{t_{i}}}) = (\mathbf{R}_{D_{h}})^{T} W_{S_{1}} \mathbf{R}_{C_{t_{i}}} $$
(19)
$$ Score_{bilinear}(\mathbf{R}_{D_{h}},\mathbf{R}_{C_{t_{i}}}) = (\mathbf{R}_{D_{h}})^{T} W_{S_{2}} ((\mathbf{R}_{C_{t_{i}}})^{T} W_{S_{2}})^{T} $$
(20)
$$ Score_{trilinear}(\mathbf{R}_{D_{h}},\mathbf{R}_{C_{t_{i}}}) = W_{S_{3}} [\mathbf{R}_{D_{h}};\mathbf{R}_{C_{t_{i}}};\mathbf{R}_{D_{h}}\mathbf{R}_{C_{t_{i}}}] $$
(21)

newwhere \(W_{S_{1}}\), \(W_{S_{2}}\) and \(W_{S_{3}}\) are the transformation matrices that need to be trained. By score function, each candidate tail entity has its score si and we set the output of model \(y^{\prime }\) as follows:

$$ s_{i}=Score(\mathbf{R}_{D_{h}},\mathbf{R}_{C_{t_{i}}}) $$
(22)
$$ y^{\prime}=softmax([s_{1};s_{2};\dots;s_{m}]) $$
(23)

where si represents the probability that its corresponding candidate tail entity is correct and newScore(⋅) refers to one of them, Scorelinear, Scorebilinear or Scoretrilinear whose performance is illustrated in experiments.

To train our model, we use softmax cross entropy function as the loss function to minimize the gap between the prediction and the ground truth.

$$ L=-\sum\limits_{i=1}^{m}{y_{i}log y^{\prime}_{i}} $$
(24)

where y is the one-hot encoding of the label of sample. yi and \(y^{\prime }_{i}\) represent i-th value of y and \(y^{\prime }\).

5 Experiments

5.1 Datasets

We use the following three public-accessed datasets for evaluating the performance of our approach in open-World knowledge graph completion. (1) FB15k dataset [4], a dataset extracted from a typical large-scale KG Freebase [3]. The dataset contains about 15,000 entities and 580,000 relational triples between entities and is often used to evaluate the effectiveness of closed-World knowledge graph completion models. (2) FB20k dataset [39] , the dataset is built upon the FB15k dataset, it first removed 47 entities from FB15K which have shorter than 3 words after preprocessed or even have no descriptions, and removed all triples containing these entities in FB15K, then by adding test triples with unseen entities, which are selected to have rich descriptions. (3) DBPedia50k dataset [30] for both open-world and closed-world KGC tasks, a dataset randomly sampled from a large-scale KG DBPedia [19]. It is worth mentioning that in order to evaluate the effectiveness of our model in open-World knowledge graph completion, we extract 2000 entities from the entity set and make sure that these 2000 entities do not appear in the training set when we divide the FB15k dataset. This represents that the test set and validation set contain 2000 entities not included in the training set. In addition, FB15k also removes these 47 entities and the triples that contain them. We denote the processed FB15K as FB15kopen. In addition, for FB20k and DBpedia50k, Shi et al. also used a similar approach to ensure that the test set and validation set contain entities not included in the training set [30]. We evaluate our approach on FB15kopen, FB20k and DBPedia50k. Statistics of datasets are shown in Table 1.

Table 1 Data set statistics

5.2 Experiment setting

Due to the lack of an open-world KGC task validation set on FB20k, we randomly sampled 10% of the test triples as a validation set.

Evaluation protocol

We use the tail entity prediction on the test set for performance evaluation. For each test triple (h,r,t) with open-world head entity \(h \in \mathbf {E}^{{\prime }}\),where \(\mathbf {E}^{{\prime }}\) is an entity superset, we rank all known entities tE by use the KGC model to calculate the actual ranking score, where E is an entity set. We then use three measures as our evaluation metrics: (1) Mean Rank (MR): the averaged rank of correct tail entities; (2) HITS@K: the proportion of correct tail entities ranked in top k; (3) Mean Reciprocal Rank (MRR): mean reciprocal rank of correct tail entities.

Note that there may be multiple triples in the dataset that have the same head entity and relationship but different tail entities: (h,r,t1),...,(h,r,tn). Following [4], when computing the Mean Reciprocal Rank (MRR), given a triple (h,r,ti) only the reciprocal rank of ti itself is evaluated (and not the best out of t1,...,ti,...,tn, which would produce better results). This differs from ConMask’s MRR evaluation method, which is the reason why result in Table 3 differs from [30] (see the asterisk (*) mark).

Note also that a filtering method called target filtering is used in ConMask: When evaluating a test triple (h,r,t), only when a triple of the form (\(?, r, t^{\prime }\)) exist in the training set, we treat the tail entity \(t^{\prime }\) as a candidate tail entity, otherwise it is skipped. Therefore, we also use target filtering when comparing with the Conmask model.

Parameter setting

Following ConMask, we set the maximum head entity description length |Dh|≤ 512 and the maximum candidate tail entity description length |Ct|≤ 512. We apply the spaCy for tokenization, part-of-speech (POS), and named entity recognition (NER). The main hyper-parameters of our model are listed in Table 2. The word embeddings are initialized by the publicly available pre-trained 200-dimensional GloVe [27] embeddings. We use Adam [17] for parameter optimization, with initial learning rate 0.002. A mini-batch of 32 samples is used to update the model parameter per step. In order to prevent overfitting, we apply dropout [33] to input embeddings and Bi-LSTM’s outputs with a drop rate of 0.4. We use PyTorch Footnote 3 to implement our model.

Table 2 Hyper-parameter settings

5.3 Open-world tail entity prediction

We compare our model MIA with other open-world KGC models, the experimental results are shown in Table 3. For a fair comparison, all the results are evaluated using target filtering.

The results for Target Filtering Baseline, DKRL and ConMask were obtained by the implementation provided by [30]. The Target Filtering Baseline assigns randomly scores to all entities that pass the target filtering. DKRL uses a two-layer convolutional neural network (CNN) over the entity descriptions. ConMask uses relationship-dependent content masking and fully convolutional neural network (FCN) to extract word-level target entity embedding from entity descriptions and then combine some other text features (entity names) are compared with the candidate tail entities to resolve a ranked list of candidate tail entities. Besides, we test the effect of the model without the attention mechanism between the candidates tail entities, which is marked as “Baseline” in Table 3.

Table 3 Open-world Tail Entity prediction results on FB20k and DBPedia50k

As can be seen from the Table 3, our MIA model significantly outperforms Conmask in HITS@K, MR, and MRR by a large margin. At the same time, we also find that the MIA model performed better on the DBPedia50k dataset than on the FB20k dataset, because the entity description in the DBPedia50k dataset is more abundant than the entity description in the FB20k dataset, where DBpedia50k dataset has an average entity description length of 454 words, FB20k dataset of 147 words. Besides, we can see that the performance on FB15kopen is weaker than other datasets since entities do not exist in the training set, which exacerbates the data sparsity problem of FB15k. In addition, Table 3 shows that the “Baseline” is slightly less effective than the full model, demonstrating the effectiveness of the attention mechanisms between candidate tail entities.

5.4 Closed-world entity prediction

Our model can also work on the closed-world KGC since the open-world KGC adds additional constraints to the closed-world KGC. The dataset used in this part, denoted as FB15kclosed, also removes the 47 entities and the triples that contain them. Unlike FB15kopen, all entities in the test set must exist in the training set for FB15kclosed. As is shown in Tables 4 and 5, we compare the results of our model on the closed-world KGC with several models including “TransR” [21], “Jointly” [40], “SimplE” [16], “TuckER ” [1] and “RotatE” [34]. Latest methods, including SimplE, TuckER, and RetatE, use a a filtering method [4] when calculating HITS: each candidate tail entity filters out the other correct candidate tail entities when calculating its ranking. Such filtering method will result in HITS values higher than those without filtering. So, for such latest methods, we contrast HITS with the filtering method in Table 5. And for traditional methods, including TransR, Jointly and ConMask, we contrast HITS without the filtering method in Table 4. We find that our model outperforms most of these baseline methods given that we enhance the interactions between model inputs with attention mechanism. For HITS@1, our method is weaker than TuckER and RotatE, probably because they can represent some of the special relations in FB15K better such as symmetric/antisymmetric relations. In addition, the results of ConMask and our model are better on FB15kclosed than on FB15kopen as a whole. This is probably because entities in FB15kclosed are more consistently distributed in the test set and training set than in FB15kopen.

Table 4 Effect comparison of closed-world KGC without the filtering method
Table 5 Effect comparison of closed-world KGC with the filtering method

5.5 Different score functions

In order to better calculate the score corresponding to the representation of the head entity and the representation of the tail entity, we try a variety of scoring functions to do evaluations in (19)–(21). The experimental results on DBPedia50k are shown in Table 6 where Scoremultiply represents multiplying vectors directly. We can see that Scorelinear, calculating the score by multiplying two vectors with an intermediate matrix, has the best performance, given that it can better simulate the interaction between the two representations. In order to illustrate the performance of these four score functions further, we demonstrate how the epochs of the training process affect the loss in Figure 4. We can see that Scoremultiply has the highest final loss, probably because simple vector multiplication does not yield the hidden interaction between two representations. Besides, Scorebilinear and Scoretrilinear converge slowly and have higher final loss than Scorelinear due to their complexity. In contrast, Scorelinear has the best performance, because it can obtain the hidden interaction between two representations through a transformation matrix without high complexity.

Fig. 4
figure 4

The loss of different score functions during training

Table 6 Effect comparison of different score functions

5.6 Interaction between multiple candidates

To investigate how candidate tail entities interact with each other through the attention mechanism, we visualize the weight matrix between multiple candidates in a running example under different number of epochs. As shown in Figure 5, n cells in the i-th row represent the weights of n candidates when calculating the hidden representation of the i-th candidate where the darker color indicates higher weights. The merged attention weights over multiple candidates helps to aggregate the useful information from the current candidate into its hidden representation. Taking c1 as an example, the five colors in the first row of the three matrices in Figure 5 represent the weights of c1-c5 when calculating the hidden representation of c1. We found that as the number of epochs increases, the model gives higher weights to c1, c2 and c5, probably because they are thought to be more helpful in enhancing the hidden representation of c1.

Fig. 5
figure 5

Attention weight matrix visualization between multiple candidates

It may be difficult to understand the weight matrices if we only focus on these candidate descriptions. But when we look at the description of the head entity, we can see that the model emphasizes the candidate with “music genre”, especially “pop music”, when aggregating the information of the candidate tail entities. This may benefit from the previous attention layer generating the representations of candidate tail entity descriptions which are aware of the head entity description. The comparison with “Baseline” in Table 3 also demonstrates the effectiveness of the attention between multiple candidate tail entities.

5.7 Ablation study

We carry out model ablations to further demonstrate the effectiveness of the proposed model. Firstly, we conduct an ablation analysis on the input word representation, which consists of several components: Part-Of-Speech (POS) Embedding, Named-Entity-Recognition (NER) Embedding and Handcrafted Features Embedding etc. The experimental results on DBPedia50k are shown in Table 7, we find all the input word representation components contribute to the performance of our MIA model. This suggests that it is useful to incorporate various feature into the word representation. We also remove our multiple interaction attention in the model. The results in Table 7 show a significant drop in performance by 1.5%, which indicates that the multiple interaction attention is effective in extracting the most relevant parts from the entity text description given different relationships.

Table 7 Ablations on several model components

6 Conclusions and future work

This paper introduces an open-world KGC model called MIA that uses a word-level attention mechanism to simulate the interaction between the head entity description, head entity name, the relationship name and multiple candidate tail entity descriptions. In addition, we try to use additional textual features of head entity descriptions to enhance the head entity representation and apply the attention mechanism between candidate tail entities to enhance the representation of them. Besides, we try different scoring functions to increase the convergence of the model. Experiments on three datasets show that the MIA model has achieved significant improvement on the open-world KGC task compared to state-of-the-art models. However, MIA relies heavily on the richness of the entity descriptions, and the tail entity can be effectively predicted only when the necessary information related to the relationship is expressed in the entity description. In the future work, we consider to introduce more external knowledge into MIA to make it more robust.