Keywords

1 Introduction

A great amount of data generated in daily life often takes the form of graph structure, such as social networks, financial transactions and literature citations. Researchers have adopted the form of triple \(\left( subject,relation,object\right) \) to represent semantic information in data, and construct large-scale knowledge graphs (KG) such as DBpedia, FreeBase, and WordNet [25]. However, the KGs are usually incomplete due to data sparsity, which makes knowledge graph completion (KGC) a priority task. Knowledge graph representation learning expresses underlying semantic information by mapping the triples into continuous low-dimensional vector spaces, which is proved to be an efficient method for KGC [11].

Static KG representation learning models that ignore the temporal information, which can lead to an inaccurate semantic representation. As depicted in Fig. 1 (a), there are three relations Praise or endorse, Make optimistic comment and Criticize or denounce between Barack Obama and Iran, such knowledge

Fig. 1.
figure 1

Example of the temporal knowledge graph

can cause confusion when temporal information is neglected since these three relations are in conflict. Figure 1 (b) depicts a sample of the temporal knowledge graph (TKG), the relations between Barack Obama and Iran made clarity as the temporal information has been added. We can also observe that Iran has an Express intent to cooperate with China, Consult with Afghanistan and Host a visit with Syria, these three relation types will have various impacts on Iran, and the topology of countries and relations around Iran also determines the character of Iran. Therefore, effectively modeling the topological features of KG is essential for KG representation learning. Besides, capturing temporal features in TKG is also crucial. As shown in Fig. 1 (b), the relation Make a visit between Barack Obama and South Korea occurred at time 2014-08-16, however, Barack Obama has an relation Make optimistic comment with Iran at time 2014-12-29, since the long time interval between the two events, the former will have less influence on the latter as time passes, which also reveals that more significant temporal characteristics are typically provided by the relative time. Our model aims to well capture the topological and temporal features in TKG, in contrast to the static KG representation learning models, which ignore temporal information and process the TKG directly in a static manner, resulting in incomplete and inaccurate expression of semantic information.

In recent years, TKG representation learning has received extensive attention from both academia and industry [7], which incorporates the corresponding temporal features when expressing the semantic information in data. However, most of the current TKG representation learning models usually face many challenges. (1) The sensible time encoding, since the TKG topology is dynamic, entities should have various features at different times. Besides, time encoding should satisfy the inherent properties of time, such as the relative time can usually carry more meaningful information than absolute time, for example, when a user buys a product on the internet, the temporal information of browsing and staying on a certain product is more important than the order of browsing the products. However, the previous models mostly used simple feed-forward neural networks or recurrent neural networks to capture temporal features, which lack of in-depth theory; (2) Modeling relations appropriately, distinct relations around an entity should have different influences on current one, most of existing models fail to take into account relation attention. Topological information incomplete when various relations are addressed with the same attention weights; (3) Effectively modeling structure, the most TKG representation learning models extend on those in static KG, which focus more attention on quadruples inherent characteristics and treat the quadruples independently while ignoring structural information, and the model should also capture correlations between entity intrinsic features and temporal features when modeling structure, which is still challenging.

A TKG attention networks, named TKGAT, is proposed to solve the common problems in existing TKG representation learning models. The time encoding function based on the Bochner’s theorem [23] has been adopted to capture temporal features, which is well suited to model the properties of relative time and has a deep theoretical foundation. The weights of the different relation types are constructed by the attention network to reflect the relevant to central entity. The self-attention mechanism [19] has proved its powerful ability in various tasks, the position encoding is replaced by time encoding and decoupled attention [6] is applied to optimize self-attention, which can incorporate more extensive knowledge graph features and effectively capture the correlations between entity and time. Our contributions in this paper can be summarized as follows.

  1. (1)

    We propose a novel temporal knowledge graph representation learning model, TKGAT, which encodes temporal information based on Bochner’s theorem and uses attention networks to capture different relations weight in order to efficiently model relational information and improve model performance.

  2. (2)

    By separating structure and time encoding to optimize the traditional self-attention mechanism, a decoupled attention approach is designed, which combines graph neural networks to efficiently capture correlations between entity and temporal features.

  3. (3)

    The model proposed in this paper achieves the best experimental results on three public datasets, further demonstrating the effectiveness of the model and outperforming baseline methods.

The rest of this paper is organized as follows. Section 2 presents related works. We introduce preliminaries in Sect. 3. We describe the proposed model in detail in Sect. 4. Section 5 reports the experimental results, and we conclude in Sect. 6.

2 Related Work

In this section, the traditional static KG representation learning models and the TKG representation learning models are introduced.

2.1 Static Knowledge Graph Representation Learning

At present, most of the existing knowledge graph representation learning models are suitable for static KG, which can be classified into three categories. The first category is the translation-based model, which makes the head and tail entities satisfy the translation constraints of the relation, and measure the truth of the triples by calculating the Euclidean distance between the head and tail entity vectors after the translation. TransE [1], TransH [20], and TransR [13] are the most representative models, since the simple and efficient nature of TransE, there are a series of subsequent works that extended on TransE. The second category is the semantic matching based model, which evaluates the plausibility of a fact by matching the underlying semantic information of entities and relations in the vector space. RESCAL [15], DistMult [24], ComplEx [18], and SimplE [9] are the simplest and most widely used models. The third category is neural network-based model, which mainly takes advantage of the excellence of neural networks in feature extraction and non-linear fitting to model KG features, representative models include ConvE [3], ConvKB [14], and RGCN [16]. However, all these models ignore the temporal information and fail to reflect the real-world change properties, resulting in lower accuracy in TKG.

2.2 Temporal Knowledge Graph Representation Learning

In recent years, temporal knowledge graph representation learning has gradually become a hot research topic. Most existing models primarily focus on extending static KG representation learning to TKG. TTransE [7] adds temporal information to the score function of the TransE and makes it satisfy the temporal information based translation constraint. HyTE [2] extends the TransH model, which projects entities and relations to a time-specific hyperplane to realize the embedding of temporal information. TA-TransE [4] represents the relation type and temporal information as a sequence of characters, then uses the LSTM to learn the time-aware representation of relation types. TComplEx [10] extends ComplEx and considers the score of each quadruple as fourth-order tensor decomposition. TeRo [21] borrows ideas from TransE and RotatE [17], which defines the temporal evolution of entity embedding as a rotation and regards relation as translation. ATiSE [22] incorporates temporal information into entity and relation representations by using additive time series decomposition and uses a multi-dimensional Gaussian distribution to represent temporal uncertainty. Inspired by diachronic word embedding, DE-SimplE [5] incorporates temporal information into diachronic entity embedding and has the capability of modeling various relation patterns. Compared to our model, these models fail to capture the rich structural information and the correlations between entity and temporal features. Another line of work on TKG representation learning employs neural networks, RE-NET [8] adopts a R-GCN based aggregator and recurrent event encoder to model the historical information. RE-GCN [12] learns the evolutional representations of entities and relations by capturing the structural dependencies and sequential patterns. However, those models focus on TKGC extrapolation task, i.e., inferring the feature facts in a sequence, which are fundamentally different from our work.

3 Preliminaries

In this section, we present the preliminaries of our work, including the definition of temporal knowledge graph and graph neural network.

3.1 Temporal Knowledge Graph

In this paper, we represent a temporal knowledge graph as \(\mathcal {G}=\left\{ \left( s , r , o , t \right) \right\} \subseteq \mathcal {V} \times \mathcal {R} \times \mathcal {V} \times \mathcal {T}\), where \(\mathcal {V}\), \(\mathcal {R}\) and \(\mathcal {T}\) indicate the sets of nodes, edges, and timestamps, respectively. Temporal knowledge graph completion (TKGC) is to solve the problem of incompleteness in TKG. Assume that the whole true facts set is \(\mathcal {F}\subseteq \mathcal {V}\times \mathcal {R}\times \mathcal {V}\times \mathcal {T}\), TKG should be a subset of the whole true facts set since the incompleteness of TKG, i.e., \(\mathcal {G}\subseteq \mathcal {F}\). TKGC is the reasoning from \(\mathcal {G}\) to \(\mathcal {F}\). According to the time range, TKGC has two settings, interpolation and extrapolation. Given a temporal knowledge graph \(\mathcal {G}\) with timestamps t range from \(t_{1}\) to \(t_{T}\), for the interpolation setting, TKGC predicts missing fatcs with \(t_{1} <t <t_{T}\); In contrast, for the extrapolation setting, TKGC predicts missing fatcs with \(t \>t_{T}\), i.e., predicting future facts based on past ones. More formally, the purpose of TKGC is to predict either the subject in a given query \(\left( ?,r,o,t \right) \) or the object in a given query \(\left( s,r,?,t \right) \). Our work is focus on the TKGC for the interpolation settings.

3.2 Graph Neural Network

Graph neural network (GNN) enjoys several advantages such as the ability to effectively handle non-Euclidean data, which makes it a great success in processing graph data. The core idea of GNN is the message propagation mechanism, i.e., the central node features are constructed by aggregating information from neighbors. In order to obtain the features of the central node i through multiple layers of GNN, each GNN layer will implement the following two steps: (1) Message Propagation, get messages from all neighbors of node i; (2) Message Aggregation, aggregate messages from all neighbor nodes then combines with the features of node i in the previous layer to obtain the features in the current layer. The above processes are defined as follows:

$$\begin{aligned} \textbf{h}^{l}_{\mathcal {N}_{i}^{k} }\leftarrow AGG \left( \left\{ \textbf{h}^{l-1}_{j},\forall j\in \mathcal {N}_{i}^{k} \right\} \right) \end{aligned}$$
(1)
$$\begin{aligned} \textbf{h}^{l}_{i}\leftarrow \sigma \textbf{W}^{l}\left( \textbf{h}^{l-1}_{i}\left| \right| \textbf{h}^{l}_{\mathcal {N}_{i}^{k} } \right) \end{aligned}$$
(2)

Steps (1) and (2) correspond to the Eqs. 1 and 2, respectively. Where \(\mathcal {N}_{i}^{k}\) denotes the k neighbors of node i, \(h_{i}^{l}\) denotes the hidden layer state of node i at l-th layer, and AGG is a specific function for aggregating the features of neighbors, which can be implemented using long short term memory (LSTM), self-attention mechanisms, etc. In this paper, we use a decoupled attention approach to implement AGG, which is able to capture more extensive features. The representative GNN models include graph convolutional networks (GCN) and graph attention networks (GAT), both of which assign weights to neighbors explicitly or implicitly during the aggregating features.

Fig. 2.
figure 2

The architecture of the TKGAT model. In this figure, in order to evaluate the truth of the quadruple (Barack Obama, Make optimistic comment, Iran, 2014-12-29). Firstly, we find the temporal neighbors where the interaction time with Barack Obama before 2014-12-29, encoded relation module combines the vectors of the subject Barack Obama, relations and temporal neighbors together to calculate attention weights and integrates the relation features into the temporal neighbors. Secondly, time encoding function based on Bochner’s Theorem is applied to capture relative time features. Thirdly, decoupled attention module learns vector of Barack Obama by capturing the structural and temporal feature, an analogous approach is used for Iran. Finally, static KGs embedding model ConvKB is adopted to evaluate score of triple that integrated temporal features.

4 Our Approach

The Fig. 2 depicts the architecture of our model. Overall, the model is based on the encoder-decoder architecture. The encoder module maps entities into a continuous low-dimensional vector space and incorporates structural and temporal features simultaneously. In view of the fact that the relations are usually irrelevant to the temporal information, the temporal features are integrated into the vector of the entity in our model. Since the different relation types have different impacts on subject, the encoder module first integrates the relation features into the objects according to the type attention weights, then employs a decoupled attention method to learn the interactions between the subjects and objects in terms of structure and time. Finally, the quadruple based \(\left( s,r,o,t \right) \) is converted into the triple \(\left( s_{t},r,o_{t} \right) \), decoder module can directly evaluate triples using the static KG embedding methods.

4.1 Encoded Relation Information

Assume that there are \(\left| \mathcal {R} \right| \) relation types and \(\left| \mathcal {V} \right| \) entities in the temporal knowledge graph \(\mathcal {G}\), the initial vectors of all entities and relations are represented as sets \(\textrm{E} =\left\{ \textbf{e}_{i} \right\} _{i=1}^{\left| \mathcal {V} \right| }\) and \(\textrm{R} =\left\{ \textbf{r}_{i} \right\} _{i=1}^{\left| \mathcal {R} \right| }\) respectively, where \(\textbf{e}_{i} \in \mathbb {R}^{d_{e}}\) represents the initial vector of i-th entity and \(\textbf{r}_{i} \in \mathbb {R}^{d_{r}}\) represents the initial vector of i-th relation, \(d_{e}\) and \(d_{e}\) represent the initial vectors dimension of entity and relation respectively. Given a quadruple \(\left( s , r , o ,t \right) \), according to the inherent characteristics of time, i.e., information about future events cannot influence the ones of the present moment, the temporal neighbors of subject \( s \) are denoted as \(\mathcal {N}_{s}^{t_{k}<t}=\left\{ (r_{i},o_{j},t_{k})|\left( s,r_{i},o_{j},t_{k} \right) \in \mathcal {G},t_{k}< t \right\} \). Since various relation types have different effects on the subject, we combine the subject vector \(\textbf{e}_{{s}}\), relation vector \(\textbf{r}_{{i}}\), and the object vector \(\textbf{e}_{{j}}\) together and calculate the attention weights by the \(\textrm{sofmax}\) function. Finally, the relation feature is incorporated into the corresponding object vector, where the attention weights are calculated as follows.

$$\begin{aligned} \textbf{u}_{r_{i},o_{j}} =\textbf{W}_{1} \left( \textbf{e}_{{s}}\left| \right| \textbf{r}_{{i}}\left| \right| \textbf{e}_{{j}} \right) \end{aligned}$$
(3)
$$\begin{aligned} \alpha _{i,j}=\textrm{softmax} \left( \textbf{u}_{r_{i},o_{j}}\right) = \frac{\mathrm {exp \left( \sigma \left( \textbf{p} \cdot \textbf{u}_{r_{i},o_{j}} \right) \right) } }{ {\textstyle \sum _{(r_{m},o_{n},t_{i})\in \mathcal {N}_{s}^{t_{i}<t}}^{}} \textrm{exp} \left( \sigma \left( \textbf{p}\cdot \textbf{u}_{r_{m},o_{n}} \right) \right) } \end{aligned}$$
(4)

where \(\textbf{W}_{1}\in \mathbb {R}^{d_{e}\times (2d_{e}+d_{r})}\), \(\textbf{p}\in \mathbb {R}^{d_{e}}\) are parameters learned during the model training, \(\sigma \) employs the \(\textrm{LeakyReLU}\) activation function. After obtaining the attention weights \(\alpha _{i,j}\) of the relation type, the temporal neighbors vectors that incorporated relation types features are calculated as follows:

$$\begin{aligned} \textbf{x}_{i,j}=\alpha _{i,j}\textbf{W}_{2}\left( \textbf{r}_{{i}}\left| \right| \textbf{e}_{{j}} \right) \end{aligned}$$
(5)

where \(\textbf{W}_{2}\in \mathbb {R}^{d_{e}\times (d_{e}+d_{r})}\) is model parameter matrix.

4.2 Encoded Temporal Information

Having obtained the vectors of entitits that incorporated the relations information, our aim is to further integrate the temporal information. Since the TKG’s structure are no longer static and the entity features may change, the time encoding should be able to show temporal characteristics, e.g. the events that happened a long time ago have less impact on the current events. We employ the time encoding function mapping from the time domain to the continuous differentiable functional domain proposed by literature [23], which is based on Bochner’s Theorem and can be compatible with gradient descent in model training, we denoted it as \(\varPhi (t)\) and the definition as follows:

$$\begin{aligned} t\rightarrow \varPhi (t):= \sqrt{\frac{1}{d_{t}}}\left[ \cos (\omega _{1}t),\sin (\omega _{1}t),...,\cos (\omega _{n}t),\sin (\omega _{n}t) \right] \end{aligned}$$
(6)

where \(\omega =\left[ \omega _{1},...,\omega _{d_{t}}\right] ^{ T }\) are learnable parameters.

4.3 Encoded Structural Information

Since the topology of the TKG contains important information, we borrow the core idea of GNN, i.e., using message propagation mechanism to capture the structural information. In order to aggregate the messages from neighbors coupled with attention weights, we adopt the decoupled attention method based on self-attention mechanism.

Given a quadruple \(\left( s , r , o ,t \right) \), the temporal neighbors of subject \( s \) are \(\mathcal {N}_{s}^{t_{k}<t}\). At time t, the vector of the subject \( s \) at layer l-th is represented as \(\textbf{h}^{l}\), when \(l=1\), \(\textbf{h}^{l}=\textbf{e}_{s}\), i.e., the initial vector of \( s \). The subject s corresponding object under relation \( r _{j}\) is \( o _{i}\), and its vector at lth layer is represented as \(\textbf{h}^{l}_{i}\), when \(l=1\), \(\textbf{h}^{l}_{i}=\textbf{x}_{i,j}\), which is obtained by the encoded relation module. Since the relative time, rather than absolute time, usually reveals critical temporal information, we directly encode the relative time \(\left\{ t-t_{1},t-t_{2},..,t-t_{k} \right\} \) using the time encoding function, then we obtain the temporal encoding of neighbors \(\left\{ \varPhi (t-t_{1}),\varPhi (t-t_{2}),...,\varPhi (t-t_{k}) \right\} \), where k denotes the number of neighbors of s at time t.

The traditional self-attention mechanism are used to process sequence structure, which add or combine the two vectors that are used to represent the content and position information of the token to construct its feature. However, this approach can’t effectively capture the correlation between content and position features. Inspired by DeBERTa [6], we apply time encoding to replace position encoding and calculate the weights by decoupled attention method.

The query vector at layer l is \(\textbf{q}=\textbf{W}_{q}\textbf{h}^{l-1}\),\(\textbf{W}_{q}\in \mathbb {R}^{d_{h}\times d_{e}}\) is the model parameter matrix, the vector of temporal neighbours and temporal encoding are constructed as matrices \(\textbf{Z}_{E}\) and \(\textbf{Z}_{T}\) respectively, which are represented at the \(l-1\) layer as:

$$\begin{aligned} \textbf{Z}_{E}=\left[ \textbf{h}_{1}^{(l-1)},\textbf{h}_{2}^{(l-1)},...,\textbf{h}_{k}^{(l-1)} \right] \in \mathbb {R}^{d_{e} \times k } \end{aligned}$$
(7)
$$\begin{aligned} \textbf{Z}_{T}=\left[ \varPhi (t-t_{1}),\varPhi ( t-t_{2}),...,\varPhi (t-t_{k}) \right] \in \mathbb {R}^{d_{t} \times k} \end{aligned}$$
(8)

Applying linear transformation on matrices \(\textbf{Z}_{E}\) and \(\textbf{Z}_{T}\):

$$\begin{aligned} \textbf{K}=\textbf{W}_{K}\textbf{Z}_{E}, \textbf{P}=\textbf{W}_{T}\textbf{Z}_{T}, \textbf{V}=\textbf{W}_{V}\textbf{Z}_{E} \end{aligned}$$
(9)

where \(\textbf{W}_{K},\textbf{W}_{V} \in \mathbb {R}^{d_{h}\times d_{e}}\), \(\textbf{W}_{T} \in \mathbb {R}^{d_{h} \times d_{t}}\) are model parameters, the attention matrix obtained by the decoupled attention approach as following:

$$\begin{aligned} \widetilde{\textbf{A}}_{0,j}= \left[ \textbf{q} \right] ^{\textsf{T}}\textbf{K}_{j} + \left[ \textbf{q} \right] ^{\textsf{T}}\textbf{P}_{j} \end{aligned}$$
(10)

the attention matrix \(\widetilde{\textbf{A}}\in \mathbb {R}^{1\times k}\), where \(\textbf{K}_{j}\) and \(\textbf{P}_{j}\) denote the j-th column of the matrix \(\textbf{K}\) and \(\textbf{P}\) respectively. In the process of calculating attention, \(\left[ \textbf{q} \right] ^{\textsf{T}}\textbf{K}_{j}\) is used to capture the correlation between the subject \( s \) and the j-th neighbour object in terms of structure, and \(\left[ \textbf{q} \right] ^{\textsf{T}}\textbf{P}_{j}\) is used to capture the correlation between the subject \( s \) and the j-th neighbour object in terms of time, the final attention matrix is obtained by adding the two above. We apply the \(\textrm{softmax}\) function to get the weights, then the final feature vector of temporal neighbors is obtained by weighted sum.

$$\begin{aligned} \textbf{h}^{l}_{\mathcal {N}_{s}^{< t}}=\textrm{softmax}\left( \frac{\widetilde{\textbf{A}}_{0,j}}{\sqrt{2d_{h}} } \right) \textbf{V} \end{aligned}$$
(11)

In order to maintain the original features of the subject s, we concatenate the final feature vector of temporal neighbors with the s hidden vector at \((l-1)\)-th layer, then pass it to a multilayer perceptron to capture non-linear interactions.

$$\begin{aligned} \textbf{h}^{l}=\textrm{MPL}\left( \textbf{h}^{l}_{\mathcal {N}_{s}^{\tau _{k}< \tau }}\left| \right| \textbf{h}^{l-1}\right) =\textrm{ReLU}\left( \left[ \textbf{h}^{l}_{\mathcal {N}_{s}^{\tau _{k}< \tau }}\left| \right| \textbf{h}^{l-1} \right] \textbf{W}^{l}_{0}+\textbf{b}^{l}_{0} \right) \textbf{W}^{l}_{1}+\textbf{b}^{l}_{1} \end{aligned}$$
(12)
$$\begin{aligned} \textbf{W}^{l}_{0}\in \mathbb {R}^{2d_{h}\times d_{h}}, \textbf{b}^{l}_{0} \in \mathbb {R}^{d_{h}}, \textbf{W}^{l}_{1}\in \mathbb {R}^{d_{h} \times d_{o}}, \textbf{b}^{l}_{1} \in \mathbb {R}^{d_{o}} \end{aligned}$$

where \(\textbf{W}^{l}_{0}\), \(\textbf{b}^{l}_{0}\), \(\textbf{W}^{l}_{1}\) and \(\textbf{b}^{l}_{1}\) are model parameters, \(d_{o}\) denotes the dimension of the final output vector. We also show that the proposed model can be easily extended to the multi-head setting which can improve performance and stability. Suppose there are m different \(\textrm{head}\), and \(\textrm{head}^{\left( i\right) }=\textbf{h}^{l\left( i\right) }_{\mathcal {N}_{s}^{t_{k}<t}}\), we concatenate the m \(\textrm{head}\) outputs with s and then carry out the same procedure as Eq. 12.

$$\begin{aligned} \widetilde{\textbf{h}}^{l}=\textrm{MPL}\left( \textrm{head}^{\left( 1 \right) }\left| \right| ,...,\left| \right| \textrm{head}^{\left( m \right) }\left| \right| \textbf{h}^{l-1}\right) \end{aligned}$$
(13)

4.4 Decoder and Training

Given a quadruple \(\eta =\left( s , r , o ,t \right) \), the encoder module of the TKGAT provides vectors with temporal information \(\left( \tilde{\textbf{s}_{t}},\textbf{r},\tilde{\textbf{o}_{t}} \right) \). Since the temporal information has been incorporated into the entity vector, the static KG model score function can be used to evaluate the triples. Among the currently existing methods, TKGAT adopts ConvKB as the decoder, the score function defined as following:

$$\begin{aligned} f \left( \eta \right) =\left( \mathop {||}\limits _{n=1}^{\left| \varOmega \right| }g\left( \left[ \textbf{s}_{t},\textbf{r},\textbf{ o}_{t} \right] *\omega ^{n} \right) \right) \textbf{W} \end{aligned}$$
(14)

where \(\varOmega \) denotes the set of convolution kernels, \(\omega ^{n}\) denotes the n-th convolution kernel, and \(\omega \in \varOmega \). \(\textbf{W}_{c}\) denotes the parameters matrix of the linear transformation, \(\varOmega \) and \(\textbf{W}_{c}\) share parameters during the model training, the activation function \(g(\cdot )\) employs \(\textrm{ReLU}\), \(*\) denotes the convolution operation. The output vectors of the \(\left| \varOmega \right| \) convolution operations are concatenated into a single vector, then linear transformation is applied to obtain the final score.

During the model training, the parameters of are learned using gradient-based optimization in mini-batches. For each quadruple \(\eta =\left( s,r,o,t\right) \in \mathcal {G} \), we sample a negative set of entities \(S=\left\{ o'|(s,r,o',t) \not \in \mathcal {G}\right\} \), then the cross-entropy loss function is used to train the model, which defined as follows:

$$\begin{aligned} \mathcal {L}=-\sum _{\eta \in \mathcal {G}}\frac{\textrm{exp}\left( f (s,r,o,t)\right) }{\textrm{exp}\left( {\textstyle \sum _{o'\not \in \mathcal {G}}} f (s,r,o',t)\right) } \end{aligned}$$
(15)

Note that, without losing generality, we used the above loss and negative samples for subject queries. The algorithm 1 shows the training process in detail.

Algorithm 1
figure a

TKGAT training algorithm

5 Experiments

In this section, to verify the effectiveness of the proposed model, we conduct experiments on link prediction tasks on three public datasets. We first introduce the experimental setup, including datasets, evaluation metrics, baselines, and implementation, and then analyze the experimental results. Furthermore, we perform several ablation studies to demonstrate the effectiveness of each main component of the proposed model.

5.1 Experimental Setup

Datasets. We evaluate our proposed models on the link prediction tasks, and three public TKGs datasets are used in our experiments. The statistics of the datasets are summarised in Table 1. For the Integrated Crisis Early Warning System (ICEWS) dataset, we use two subsets provided by [4]: ICEWS14, corresponding to facts in 2014, and ICEWS05-15, corresponding to facts between 2005 and 2015. For the Global Database of Events, Language, and Tone (GDELT) dataset, we use subsets which corresponding to facts from 1 April 2015 to 31 March 2016, each piece of data has a corresponding timestamp. We use the same splits of training, validation, and testing sets as provided by [5].

Evaluation Metrics. For each quadruple \(\left( s,r,o,t\right) \in \mathcal {D}_{test}\), where \(\mathcal {D}_{test}\) represents the test dataset, we generate two queries: \(\left( s,r,?,t\right) \) and \(\left( ?,r,o,t\right) \). For the first query, the model evaluates all entities and obtains scores \( f (s,r,o',t)\), \( \forall o'\in \mathcal {E}\), with an analogous approach used for the second query. According to the final scores, the rank of the given quadruple is obtained, and we report mean reciprocal rank \(\left( MRR\right) \) which is defined as:

$$\begin{aligned} MRR=\frac{1}{2\left| \mathcal {D}_{test} \right| } \sum _{\eta \in \mathcal {D}_{test}}\left( \frac{1}{rank\left( o|s,r,t \right) }+\frac{1}{rank\left( s|r,o,t \right) } \right) \end{aligned}$$
(16)

where \(\eta =\left( s,r,o,t \right) \), \(\left| \mathcal {D}_{test} \right| \) denotes the size of the test dataset. We also report Hits@1, Hits@3, and Hits@10 measures where Hits@k represents the percentage of correct quadruple in the k highest ranked predictions, Hits@k defined as:

$$\begin{aligned} Hit@k=\frac{1}{2\left| \mathcal {D}_{test} \right| } \sum _{\eta \in \mathcal {D}_{test}}\mathbb {I}_{\left( rank\left( o|s,r,t \right) \le k \right) }+\mathbb {I}_{\left( rank\left( s|r,o,t \right) \le k \right) } \end{aligned}$$
(17)

where \(\mathbb {I}_{\left( \cdot \right) }\) is an indicator function, \(\mathbb {I}_{\left( cond\right) }\) is 1 if cond holds and 0 otherwise.

Table 1. Statistics of datasets.

Baselines. We test the performance of the proposed model against a variety of strong baselines, including static KG representation learning models and TKG representation learning models. Note that all these static models are applied without considering the time information in the input, including: TransE [1], DistMult [24], ComplEx [18], and SimplE [9]. The other TKG representation learning baselines models include: TTransE [7], HyTE [2], TA-TransE [4], DE-SimplE [5], ATiSE [22], and TeRo [21]. As TGAT [23] is specifically designed to handle dynamic network graphs not TKG, we have not compared with it.

Table 2. Evaluation results on link prediction. The best results are in bold and the second-best results are underlined.

Implementation. We implemented our model and the baselines in PyTorch and conducted the experiments on an NVIDIA Tesla V100 GPU. The vectors dimension of the entity, relation, and time are fixed to 128. We also tried to use different score functions to train the model, finally, we chose the ConvKB model as our decoder. The number of temporal neighbors samples is set to 20 for ICEWS14 and ICEWS05-15 datasets, 50 for the GDELT dataset. Theoretically, the information from multi-hop neighbors can be aggregated in our model, to speed up training, only the information about the 2-hop neighbors is aggregated. The number of attention heads and negative samples is set to 4 and 200 respectively, and the Adam SGD optimizer is applied to train model, we set 0.001 as the learning rate for all datasets.

5.2 Results and Analysis

Table 2 shows the experimental results of link prediction on ICEWS14, ICEWS05-15, and GDELT datasets. From the result, we can observe that the static KG representation learning models fell behind TKG models in most cases. The primary reason is static KG models only learned one representation for each entity or relation, without taking into account the temporal information.

The results also demonstrate the state-of-the-art performance of our approach for link prediction tasks. As we can see, the TKGAT model significantly improves on the suboptimal TeRo model for most metrics. The typical TKG representation learning models DE-SimplE, ATiSE, and TeRo, which pay more attention to model temporal information while ignoring to capture of the TKG topology structural information. In contrast, our model is based on the GNN framework, which has the advantage of building structural features. Besides, our model adopted attention networks to model relation weights and decoupled attention is applied to incorporate more extensive TKG structural features, which allowed our model accurately to describe entities and relations characteristics. TKGAT obtained central entity features by aggregating temporal neighbours, a large number of network parameters were used to learn the features, which increased a little model complexity but improved the accuracy. Meanwhile, time encoding function based on Bochner’s theorem was employed to model relative time features, which further improved the model performance.

The experimental results also exhibt that the improvement in ICEWS05-15 and GDELT is greater than ICEWS14 dataset. the main reason is the comparatively small scale of the ICEWS14 dataset, in order to achieve the best prediction results, a large amount of training data is required. In addition, the results show that the model performance on the ICEWS14 and ICEWS05-15 datasets are better than those on the GDELT datasets, the major reason is the quite small scale of the entities and relation types in the GDELT dataset, however, the interactions between entities are extremely complex, which makes challenging to extract effective information from the extremely complex interactions. Furthermore, the quality of the GDELT dataset is slightly lower, resulting in a relatively lower accuracy.

Fig. 3.
figure 3

Ablation study on three datasets

5.3 Ablation Study

To verify the effectiveness of each component in TKGAT, firstly, we implemented a version of TKGAT with all temporal attention weights set to the same value (-Time) to prove the validity of the time encoding function based on Bochner’s theorem. Secondly, we removed the decoupled attention module (-Decoupled) and adopted the traditional self-attention mechanism directly to calculate attention scores between different entities. Finally, we incorporated relations information directly into the object using a linear transformation (-Linear) to verify the effectiveness of modeling relation weights.

As shown in Fig. 3, the TKGAT-Time model significantly reduced on MRR metric in all datasets, which proved the effectiveness of the time encoding function, and we can also notice that building temporal features in TKG is essential. In addition, the results show that the TKGAT-Decoupled model performed worse than the TKGAT model, which proved that the decoupled attention method is beneficial for improving the performance of the attention mechanism, and the correlations between entity and temporal features captured by decoupled attention are effective for TKG representation learning. We can also observe that the TKGAT-Linear model worked slightly worse than the TKGAT model, which indicates the effectiveness of capturing relations weights.

6 Conclusion

In this paper, we present a novel model, called TKGAT, for temporal knowledge graph representation learning. Specifically, time encoding function based on Bochner’s theorem was applied to efficiently model relative time information, decoupled attention was adopted to capture the correlations between entity and temporal features, and the different relations influences were learned by attention network. Experimental results show that the TKGAT can effectively model temporal knowledge graph features. The ablation study also demonstrates the effectiveness of each component of TKGAT. For future work, the generation of time-aware discriminative negative samples is worth exploring.