Keywords

1 Introduction

The knowledge graph (KG), also known as the semantic network, represents the network of real-world entities and illustrates the relations between them. It has been widely used in various applications, such as language representation learning, question answering, recommender systems, and so on.

To date, existing KGs are usually constructed by machine learning algorithms automatically, but there are many hidden relations that have not been observed, so they are often incomplete. In view of this situation, the link prediction task in KG aims to predict the missing entities or relations, which is also called knowledge inference [1] and knowledge completion [2]. The real-world KGs are usually dynamic, in which more new entities and new types of relations are observed in the KGs over time. We call such KGs with dynamic changes over time as temporal knowledge graphs (TKGs). Figure 1 shows an example of a temporal knowledge graph that evolves over time from the year 1998 to 2013. As can be seen from the observed TKG facts in Fig. 1, Tencent was founded by Pony Ma in 1998 and had multiple relations with other entities from 2010 to 2013, resulting in the semantics of the entity of Tencent shifts over time. Hence, modeling temporal information in KGs is crucial to understanding how the knowledge evolves over time. In this paper, we study the problem of learning TKG embedding for link prediction.

Fig. 1.
figure 1

An example of the evolution of temporal knowledge graph.

However, the TKG embedding for link prediction is often a challenging task due to the following reasons: (1) it is arduous to simulate the strong time dependency in TKG; (2) there are some potential factors that affect the network evolution.

In this paper, we propose an approach, called TKGE, to learn the evolutional embeddings of entities and relations in TKG. We formulate the link prediction task as a conditional probability problem covering entity prediction and relation prediction. For the provided approach, we would like to address the two challenges highlighted earlier. Specifically, we utilize self-attention mechanism to model TKG from both static structural information and dynamic evolutions of the graph to simulate the complex time dependency better. We employ the multi-head self-attention to model the TKG from many aspects. With multi-head self-attention, TKGE can perform very efficiently but still generate accurate evolutional embeddings. Extensive results on the six datasets have shown the effectiveness and rationality of the proposed model.

2 Related Work

The link prediction task on TKG aims to predict missing quadruples. It is further divided into two subtasks: entity prediction and relation prediction. Traditional research on KG link prediction tasks mainly focuses on static KGs, that is, facts do not change with time, and less research on TKGs. However, the time sequence information in the TKG is very helpful to capture the dynamic trend of facts, and the interactive data in the real world is not invariable.

In general, TKG link predictions are categorized in two settings: interpolation and extrapolation. Given a TKG with the time interval \([t_0,t_n]\), interpolation aims to predict missing facts at timestamp t such that \(t_0 \le t\le t_n\) and extrapolation, which aims to predict new facts at timestamp t such that \(t\ge t_n\). For interpolation, DE-SimplE [3] draws inspiration from diachronic word embeddings, developing a diachronic embedding function to generate a hidden representation for the entity at any given time. However, it is hard to help supplement the KG on future timestamps.

For extrapolation, existing methods are achieved through various techniques such as temporal point process framework [4, 5], CNN-based methods [6] and deep recurrent models [7, 8]. Mei et al., propose N-SM-MPP [4] model for link prediction using the Hawkes process to capture the dynamic temporal information. In addition, DyRep [5] considers representation learning as a latent mediation process. However, these two methods are more suitable for model TKG in continuous time. For the multi-relational, directed graph structure of each KG, REGCN [6] utilize GCN, which it’s quite a powerful model for graph-structured data to characterize structural dependencies. Jin et al., develop an autoregressive framework [7], called RE-NET, utilizing recurrent event encoder and neighborhood aggregator to model the information of dynamic temporal and concurrent events in the same timestamp. Despite the effectiveness of the methods mentioned above, these methods have limitations to some degree, because they do not consider the special properties of TKG that is there exist some potential factors that affect the network evolution.

3 Problem Formulation

We first give notations used in this paper and then formalize the TKG representation learning problem to be addressed.

Notations. A TKG is defined as a sequence of static KGs with timestamps, \(G=\{G_1,\cdots ,G_n\}\) where n represents the number of timestamps. Each KG \(G_t\) is a multi-relational, directed graph which contains all the facts that co-occur at timestamp t and we define it as \(G_t=\{V,R,\mathcal {E}_t\} \) where V denotes the set of entities, R denotes the set of relations, and \(\mathcal {E}_t \) denotes the set of facts at timestamp t. Any fact in \(\mathcal {E}_t \) can be represented as a quadruple (subject entity, relation, object entity, timestamp) and is denoted by (srot), where \(s,o\in V\) and \(r \in R\).

Problem Definition. The TKG representation learning task aims to learn latent representations \(\textbf{h}_{o,t} \in \mathbb {R}^{F'}\) and \(\textbf{r}_t \in \mathbb {R}^{F'}\) for each node \(o \in V\) and relation \(r \in R\) at time \(t = \{1, 2, . . . ,n\}\), such that \(\textbf{h}_{o,t}\) preserves both the graph structures and dynamic temporal information. Formally, the problem can be defined as:

Input: A sequence of static KGs with timestamps, \(G=\{G_1,\cdots ,G_n\}\), entity embedding matrices \(\textbf{H}_t\) and relation embedding matrices \(\textbf{R}_t\).

Output: A map function \(f: V\cup R \rightarrow \mathbb {R}^{F'}\), which maps each entity and relation in G to a \(F'\)-dimensional embedding vector.

4 Methodology

In this section, we first introduce our proposed method, TKGE, which consists of two components. The structural self-attention module is used to capture the static structural information of the KG at each timestamp t and the temporal self-attention module is used to capture the evolution of the dynamic KG over time. The overall framework of the model is shown in Fig. 2. First, the structural self-attention module uses self-attention [9, 10] to aggregate the local static structural information of each entity in each timestamp t generating structural embeddings for entities. Then, by introducing position embedding, temporal self-attention module can capture the evolution of KG in different timestamps. Finally, based on the representations of the entities and relations learned above, we can use various scoring functions to perform prediction tasks in future timestamps. It is worth mentioning that, intuitively, some latent facts in the real world can affect the evolution of graphics. This is evident in the case of citation networks, papers of different research fields may expand their citation papers at significantly varying rates. Since we introduce multi-head self-attention mechanism [11, 12] in the structural self-attention module and temporal self-attention module, which can not only characteristic the evolution of the KG from different aspects but also reduce the deviation of prediction and improve the stability of the model.

Fig. 2.
figure 2

The overall architecture of the TKGE model.

4.1 Structural Self-attention

To capture the static structural information in the KG at each timestamp t, this component is designed to realize the mapping from a series of TKGs \(G=\{G_1,\cdots ,G_n\}\) to a series of entity embedding matrices \(\{{\textbf {H}}_1,\cdots ,{\textbf {H}}_t\} \). The initial entity embedding matrix is obtained by random initialization.

Specifically, the structural self-attention mechanism calculates the structural representation of the entity by learning the importance of the relevant entity to the target entity and assigning different weights to each relevant entity: \(\textbf{h}_{o,t} =\sigma (\frac{1}{c_o}\sum \alpha _{s,o}W^s(\textbf{h}_{s,t}+\textbf{r}_t))\), where \(\textbf{h}_{s,t} \) and \(\textbf{r}_t\) are the embeddings of entity s and relation r at timestamp t, respectively. \(c_o\) is the in-degree of the object entity o. \(\sigma (\cdot )\) is a nonlinear activation function. In particular, \(\textbf{h}_{s,t}+\textbf{r}_t \) represents the translation attribute between the entity and the relation in the triplet. The calculation formula of the specific weight \(\alpha _{s,o}\) is as follows:

$$\begin{aligned} \begin{aligned} \textbf{e}_{s,o}=\sigma (A_{s,o}\cdot \mathbf {\alpha } ^T[W^s \textbf{h}_{ot}||W^s\textbf{h}_{s,t}]),\forall (s,r,o)\in \mathcal {E}_t, \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} \alpha _{s,o}=\frac{exp(e_{s,o})}{\sum _{(s,r),\exists (s,r,o)\in \mathcal {E}_t}}, \end{aligned} \end{aligned}$$
(2)

where \(A_{s,o} \) represents the weight matrix of the link between the subject entity and the object entity at the current timestamp t. In this paper, the number of edge occurrences is used as the weight. \(\mathbf {\alpha } \in \mathbb {R}^{2D}\) and \(W^s\in \mathbb {R}^{F\times D} \)denote the weight vector and the parameter matrices, respectively. || is the concatenation operation and \(\sigma (\cdot )\) denotes the non-linear activation function. Significantly, this module is composed of multiple stacked structural self-attention layers. By stacking multiple layers, our method can further consider the high-order relation between entities to generate the final structural embedding of the entity at each timestamp.

4.2 Temporal Self-attention

To capture the dynamic temporal information in TKG, we first utilize position embedding \(\{\textbf{p}^1,\cdots ,\textbf{p}^\top \},\textbf{p}^t\in \mathbb {R}^F\) to describe the dynamic temporal sequence information of each static KG. Then, temporal self-attention module takes a series of embeddings for a particular entity o at different timestamps calculated by \(\textbf{h}_{o,t}+\textbf{p}^t\) as input and returns a sequence of temporal entity embeddings at different timestamps. Meanwhile, this component utilize GRU to realize the mapping from a series of TKGs \(G=\{G_1,\cdots ,G_n\} \) to a series of relation embedding matrices \( \{ {\textbf {R}}_1,\cdots ,{\textbf {R}}_n\}\). The initial relation embedding matrix is obtained by random initialization. Specifically, the input and the output for each entity o is denoted by \(\{ \textbf{x}_{o,t_1},\cdots ,\textbf{x}_{o,t_n}\}, \textbf{x}_{o,t}\in \mathbb {R}^{D'}\) and \(\{ \textbf{h}_{o,t_1},\cdots ,\textbf{h}_{o,t_n}\}, \textbf{h}_{o,t}\in \mathbb {R}^{F'}\). Where n, \(D'\) and \(F'\) denote the number of timestamps and the dimension of the input and the output vector. In practice, we compute the input and the output embeddings for entity o at different timestamps, packed together into a matrix \(\textbf{X}_o \in \mathbb {R}^{T\times D'}\) and \(\textbf{H}_o \in \mathbb {R}^{T\times F'}\). Specifically, the output embedding of the entity is mainly calculated according to \(\textbf{H}_{o}=\beta _{o}(\textbf{X}_o\textbf{W}_v)\), where \(\beta _{o} \in \mathbb {R}^{T \times T}\) is the weight matrix of attention. This calculation first converts the queries, keys, and values to different spaces through the learned matrix \(\textbf{W}_q \in \mathbb {R}^{D'\times F'},\textbf{W}_k \in \mathbb {R}^{D'\times F'},\textbf{W}_v \in \mathbb {R}^{D'\times F'}\), and then uses the scaled dot-product attention mechanism to calculate the attention score and weight:

$$\begin{aligned} \begin{aligned} e_o^{ij}=(\frac{((\textbf{X}_o\textbf{W}_q)(\textbf{X}_o\textbf{W}_k)^\top )_{ij}}{\sqrt{F'}}+M_{ij}). \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} B_o^{ij}=\frac{exp(e_o^{ij})}{\sum _{k=1}^{n}exp(e_o^{ik})}, \end{aligned} \end{aligned}$$
(4)

where \(\textbf{M} \in \mathbb {R}^{T\times T}\), is the mask matrix, which is used to encode autoregressive attributes, that is, to use information from a few previous timestamps to describe the information at a later timestamp. If \(i \le j\), \(M_{ij}=0\), otherwise \(M_{ij}=-\infty \).

Intuitively, the relation embedding contains the information of the entity in the corresponding fact, so the relation’s embedding at timestamp t will be affected by the entity information related to the relation r at timestamp t and its own information at timestamp \(t-1\), where the related entities is denoted by \(E_{r,t}=\{i|(i,r,o,t) or (s,r,i,t) \in \mathcal {E}_t\}\). Inspired by [6], by applying the average pooling operation in the embedding matrix of the entities related to the relation r at timestamp \(t-1\), \(H_{t-1,E_{r,t}}\), the calculation formula of relation embedding is: \(\textbf{r}_{t}'=[pooling(\textbf{H}_{t-1,E_{r,t}})||\textbf{r}]\), where \({\textbf {r}}\) is the embedding vector of the relation r. Especially, when there is no relation at timestamp t, \(\textbf{r}_{t}'=0\). Then, according to the relation embedding matrix \(\textbf{R}'\) and \(\mathbf {R_{t-1}}\) at timestamp \(t-1\), this paper utilizes GRU to obtain the updated relationship matrix.

After obtaining the evolutional embeddings of entities and relations, we perform link prediction problems in a probabilistic way. Considering the good results shown by GCN in KG link prediction, we choose ConvTransE [13] as the decoder. Therefore, the specific calculation formula of the entity conditional probability vector is as follows:

$$\begin{aligned} \textbf{p}(o|s,r,\textbf{H}_t,\textbf{R}_t)=\sigma (\textbf{H}_tConvTransE(\textbf{s}_t,\textbf{r}_t)), \end{aligned}$$
(5)

where \(\sigma (\cdot )\) denotes the sigmoid function. Similarly, we can get the relational conditional probability vector according to the following formula:

$$\begin{aligned} \textbf{p}(r|s,o,\textbf{H}_t,\textbf{R}_t)=\sigma (\textbf{R}_tConvTransE(\textbf{s}_t,\textbf{o}_t)), \end{aligned}$$
(6)

where \(\textbf{s}_t,\textbf{r}_t\) and \(\textbf{o}_t\) are the corresponding elements in the entity and relation embedding matrices \(\textbf{H}_t\) and \(\textbf{R}_t\), respectively.

4.3 Parameter Learning

This paper takes the entity and relation prediction task as a multi-classification problem, where each category corresponds to an entity or relation, and \(\textbf{y}_{t+1}^o\in \mathbb {R}^{|V|}\) or \(\textbf{y}_{t+1}^r\in \mathbb {R}^{|R|} \) is used to represent the label vectors of the entity or relation prediction task under the \(t+1\) timestamp. If a value is 1 indicates the corresponding fact happens. The corresponding loss functions are as follows:

$$\begin{aligned} L^o=\sum _{t=0}^{n-1}\sum _{(s,r,o,t+1)\in \mathcal {E}_{t+1}}\sum _{i=0}^{|V|-1}y_{t+1,i}^o \log p_i(o|s,r,\textbf{H}_t,\mathbf {R_t}), \end{aligned}$$
(7)
$$\begin{aligned} L^r=\sum _{t=0}^{n-1}\sum _{(s,r,o,t+1)\in \mathcal {E}_{t+1}}\sum _{i=0}^{|R|-1}y_{t+1,i}^r \log p_i(r|s,o,\textbf{H}_t,\mathbf {R_t}), \end{aligned}$$
(8)

where n represents the number of timestamps in the training set, \( y_{t+1,i}^o\) and \(y_{t+1,i}^r \) is the value of the i-th element of \(\textbf{y}_{t+1}^o \) and \(\textbf{y}_{t+1}^r \) respectively.

4.4 Discussion

To verify the efficiency of our algorithm, this section analyzes the time complexity of the model. First, the self-attention mechanism mainly includes three steps: similarity calculation, softmax, and weighted summation. After the analysis, the time complexity of this part is \(O(n^2d)\), where n is the maximum length of the sequence, and d is the dimension of embedding. In addition, we use GRU component to calculate relation embedding, where the time complexity of pooling operation is O(|R|D), where D is the maximum number of related entities with relation r at timestamp t, and R is the number of elements in the relation set. Therefore, the total time complexity is \(O (n^2d + |R|D)\).

5 Experiments

To evaluate the entity and relation embeddings learned by TKGE, we employ them to address the link prediction problem in TKG. Link prediction in TKG can be regarded as a classification task, which is mainly divided into entity prediction and relation prediction. Through experiments, we aim to answer the following research questions:

Table 1. Descriptive statistics of datasets.

RQ1: How does TKGE perform compared with state-of-the-art temporal knowledge graph embedding methods?

RQ2: Is the construction of structural and temporal self-attention component helpful to learn more desirable representations for temporal knowledge graph?

RQ3: How do the key hyper-parameters affect the performance of TKGE?

In what follows, we first introduce the experimental settings, and then answer the above research question in turn.

5.1 Experimental Settings

Dataset. We utilize six commonly used TKGs datasets: ICEWS18, ICEWS14, ICEWS05-15, WIKI, YAGO, and GDELT. The first three datasets come from the data captured and processed by the Integrated Crisis Early Warning System (ICEW). The data in the GDELT comes from various international news reports, such as the New York Times, Washington Post, etc. The statistics of the datasets are summarized in Table 1.

Evaluation Protocols. To evaluate the performance of link prediction task, we performed the same processing on the dataset as RE-GCN. For the ICEWS14 and ICEWS05-15 datasets, we randomly select 80% of the instances as the training set, 10% of the instances as the validation set, and the remaining 10% as the test set. The details of the dataset division are shown in Table 1. Based on previous research, this paper selects two evaluation indicators commonly used in link prediction, namely Mean Reciprocal Ranks (MRR) and Hits@K.

Baselines. We compare TKGE with two types of baselines: TKG link prediction models under the interpolation setting and TKG link prediction models under the extrapolation setting. For interpolation, we select HyTE [14], TTransE [15] and TA-DistMult [16] as comparison method. Similarly, for extrapolation, we select RGCRN [17], CyGNet [18], RE-NET [7] and RE-GCN [6] as comparison method. Especially, RGCRN is an extended model of GCRN, which originally for the homogeneous graphs. Specifically, RE-NET replace GCN with R-GCN.

Parameter Settings. We implement our method TKGE in TensorFlow and carry it out on Tesla V100. The training epoch is limited to 200. We adopt the Adam optimizer for parameter learning with a learning rate of \(10^{-3}\). For all models, we set the dimension of embedding as 200 and the batch size as 256 for a fair comparison.

Table 2. Entity prediction performance on different datasets.

5.2 Performance Comparison (RQ1)

Entity Prediction. In this task, the results of all the comparison methods are presented in Table 2. Firstly, the TKGE is superior to TKG link prediction models under the interpolation setting, such as HyTE, TTransE, and TA-DistMult, because TKGE additionally captures the static structural information of the KG at each timestamp t and the temporal evolution information of the TKG. This improvement demonstrates that the accuracy of evolutional embedding generated from TKGE has significantly improved. Secondly, the temporal models for the extrapolation setting, such as CyGNet, RE-NET, and RE-GCN, have good performance. The result verifies the effect of the repetitive patterns and the direct neighbors on the entity prediction task. Among them, the RE-GCN is the most efficient method. Compared with CyGNet and RE-NET, RE-GCN considers the static structure information in each timestamp t and the dynamic temporal information over time and can obtain satisfactory evolutional representations. Thirdly, our proposed method TKGE performs very well on most datasets. It is noteworthy that compared with other datasets, the experiments result in GDELT are not as superior as others. After further analysis, the reason for this phenomenon may be that there are many abstract entities that don’t specify concrete entities (e.g., teacher and school). Specifically, with the time interval increasing, the performance gap between the last two rows in the Table 2 is becoming larger. For WIKI and YAGO, because the long time interval determines the data at each timestamp t having more static structural information, it is helpful to improve the performance of the entity prediction task. Since the TKGE method simultaneously models the static structural and dynamic temporal information, it is efficient for link prediction tasks in TKG.

Relation Prediction.

Due to the space limitation, we only compare TKGE with some typical TKG methods and show the performance on MRR. Specifically, we select the RGCRN and RE-GCN, because we can use them for relation prediction tasks directly. As illustrated in Table 3, we can observe that the performance of the proposed model TKGE is always better than other baselines in all the datasets. The superiority of the TKGE demonstrates that our designed structural and temporal self-attention module can generate more accurate evolutional entity and relation representations. Compare with the entity prediction task, the performance gap on the relation prediction task is smaller. This is probably because the number of relations was limited, making relation prediction tasks easier than entity prediction tasks. For example, due to the number of relation on WIKI and YAGO being 24 and 10, the experiment results on both dataset is superior to other datasets.

Table 3. Relation prediction performance on different datasets.
Table 4. TKGE with and without structural self-attention or temporal self-attention.

5.3 Utility of Structural and Temporal Self-attention (RQ2)

To demonstrate the effectiveness of our designed structural and temporal self-attention module we compare TKGE with its variants TKGE-NS and TKGE-NT. Two variants represent the model removing the structural self-attention module and the model removing temporal self-attention modules, respectively. As shown in Table 4, we can see that the performance of TKGE-NS and TKGE-NT are lower than TKGE in all datasets. This result verifies the validity of the structural and temporal self-attention module.

5.4 Hyper-Parameter Studies (RQ3)

Due to the space limitation, we only analyze the effect of the number of heads h in multi-head structural and temporal self-attention, since this number plays a crucial role to model TKG. Specifically, except for the measured parameters, we keep other parameters fixed for fairness. We analyze entity prediction tasks on datasets ICEWS14 and YAGO individually. As illustrated in Table 5, we find that multi-head self-attention can improve the performance of TKGE effectively. In addition, with the parameter h increasing, the performance of entity prediction tasks will increase firstly then reduce after the locally optimal value. In this paper, the best number of attention heads is 8. In general, the multi-head self-attention which can model TKG from many angles is an effective method for entity prediction tasks.

Table 5. Impact of hyper-parameter h on entity prediction.

6 Conclusions

We have presented TKGE, a novel approach for embedding TKG. It jointly models both the static structural and temporal information in learning evolutional representations for entities and relations. Extensive experiments on six real-world datasets demonstrate the effectiveness and rationality of our TKGE method. In this work, we have regarded TKG as a series of static KG snapshots, thus we only explore the discrete-time approach. Since discrete-time approach characterizes temporal information in relatively coarse levels, the learned embeddings may lose some information between snapshots. To address this issue, we plan to extend our TKGE method to explore continuous-time approaches [19,20,21] to integrate more fine-grained temporal information.