TKGAT: Temporal Knowledge Graph Representation Learning Using Attention Network

Zhang, Shaowei; Li, Zhao; Wang, Xin; Chen, Zirui; Guo, WenBin

doi:10.1007/978-3-031-46664-9_4

Shaowei Zhang¹⁵,
Zhao Li¹⁵,
Xin Wang¹⁵,
Zirui Chen¹⁵ &
…
WenBin Guo¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14177))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

778 Accesses

Abstract

Temporal knowledge graph representation learning models can capture more comprehensive semantic information, which has higher practical application value and gradually attracts wide attention. However, the existing temporal knowledge graph representation learning models usually have challenges in encoding temporal information and capturing rich structural information. In this paper, we propose a novel temporal knowledge graph representation learning model, named TKGAT, which is based on graph neural networks using Bochner’s theorem to design time encoding function that can flexibly learn relative time information. Furthermore, attention network is adopted to model different relations features and the self-attention mechanism is optimized by the decoupled attention method, so that the attention weight matrix incorporates more extensive temporal and structural information and learns the correlations between entity and temporal features. The extensive experiments have shown that the proposed model can consistently outperform state-of-the-art models over all benchmark datasets.

S. Zhang, Z. Li—Contributed equally to this research.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Simple But Powerful Graph Encoder for Temporal Knowledge Graph Completion

Evolving Temporal Knowledge Graphs by Iterative Spatio-Temporal Walks

Sequence-Based Modeling for Temporal Knowledge Graph Link Prediction

Keywords

1 Introduction

A great amount of data generated in daily life often takes the form of graph structure, such as social networks, financial transactions and literature citations. Researchers have adopted the form of triple $\left( subject,relation,object\right) $ to represent semantic information in data, and construct large-scale knowledge graphs (KG) such as DBpedia, FreeBase, and WordNet [25]. However, the KGs are usually incomplete due to data sparsity, which makes knowledge graph completion (KGC) a priority task. Knowledge graph representation learning expresses underlying semantic information by mapping the triples into continuous low-dimensional vector spaces, which is proved to be an efficient method for KGC [11].

Static KG representation learning models that ignore the temporal information, which can lead to an inaccurate semantic representation. As depicted in Fig. 1 (a), there are three relations Praise or endorse, Make optimistic comment and Criticize or denounce between Barack Obama and Iran, such knowledge

can cause confusion when temporal information is neglected since these three relations are in conflict. Figure 1 (b) depicts a sample of the temporal knowledge graph (TKG), the relations between Barack Obama and Iran made clarity as the temporal information has been added. We can also observe that Iran has an Express intent to cooperate with China, Consult with Afghanistan and Host a visit with Syria, these three relation types will have various impacts on Iran, and the topology of countries and relations around Iran also determines the character of Iran. Therefore, effectively modeling the topological features of KG is essential for KG representation learning. Besides, capturing temporal features in TKG is also crucial. As shown in Fig. 1 (b), the relation Make a visit between Barack Obama and South Korea occurred at time 2014-08-16, however, Barack Obama has an relation Make optimistic comment with Iran at time 2014-12-29, since the long time interval between the two events, the former will have less influence on the latter as time passes, which also reveals that more significant temporal characteristics are typically provided by the relative time. Our model aims to well capture the topological and temporal features in TKG, in contrast to the static KG representation learning models, which ignore temporal information and process the TKG directly in a static manner, resulting in incomplete and inaccurate expression of semantic information.

In recent years, TKG representation learning has received extensive attention from both academia and industry [7], which incorporates the corresponding temporal features when expressing the semantic information in data. However, most of the current TKG representation learning models usually face many challenges. (1) The sensible time encoding, since the TKG topology is dynamic, entities should have various features at different times. Besides, time encoding should satisfy the inherent properties of time, such as the relative time can usually carry more meaningful information than absolute time, for example, when a user buys a product on the internet, the temporal information of browsing and staying on a certain product is more important than the order of browsing the products. However, the previous models mostly used simple feed-forward neural networks or recurrent neural networks to capture temporal features, which lack of in-depth theory; (2) Modeling relations appropriately, distinct relations around an entity should have different influences on current one, most of existing models fail to take into account relation attention. Topological information incomplete when various relations are addressed with the same attention weights; (3) Effectively modeling structure, the most TKG representation learning models extend on those in static KG, which focus more attention on quadruples inherent characteristics and treat the quadruples independently while ignoring structural information, and the model should also capture correlations between entity intrinsic features and temporal features when modeling structure, which is still challenging.

A TKG attention networks, named TKGAT, is proposed to solve the common problems in existing TKG representation learning models. The time encoding function based on the Bochner’s theorem [23] has been adopted to capture temporal features, which is well suited to model the properties of relative time and has a deep theoretical foundation. The weights of the different relation types are constructed by the attention network to reflect the relevant to central entity. The self-attention mechanism [19] has proved its powerful ability in various tasks, the position encoding is replaced by time encoding and decoupled attention [6] is applied to optimize self-attention, which can incorporate more extensive knowledge graph features and effectively capture the correlations between entity and time. Our contributions in this paper can be summarized as follows.

(1)
We propose a novel temporal knowledge graph representation learning model, TKGAT, which encodes temporal information based on Bochner’s theorem and uses attention networks to capture different relations weight in order to efficiently model relational information and improve model performance.
(2)
By separating structure and time encoding to optimize the traditional self-attention mechanism, a decoupled attention approach is designed, which combines graph neural networks to efficiently capture correlations between entity and temporal features.
(3)
The model proposed in this paper achieves the best experimental results on three public datasets, further demonstrating the effectiveness of the model and outperforming baseline methods.

The rest of this paper is organized as follows. Section 2 presents related works. We introduce preliminaries in Sect. 3. We describe the proposed model in detail in Sect. 4. Section 5 reports the experimental results, and we conclude in Sect. 6.

2 Related Work

In this section, the traditional static KG representation learning models and the TKG representation learning models are introduced.

2.1 Static Knowledge Graph Representation Learning

At present, most of the existing knowledge graph representation learning models are suitable for static KG, which can be classified into three categories. The first category is the translation-based model, which makes the head and tail entities satisfy the translation constraints of the relation, and measure the truth of the triples by calculating the Euclidean distance between the head and tail entity vectors after the translation. TransE [1], TransH [20], and TransR [13] are the most representative models, since the simple and efficient nature of TransE, there are a series of subsequent works that extended on TransE. The second category is the semantic matching based model, which evaluates the plausibility of a fact by matching the underlying semantic information of entities and relations in the vector space. RESCAL [15], DistMult [24], ComplEx [18], and SimplE [9] are the simplest and most widely used models. The third category is neural network-based model, which mainly takes advantage of the excellence of neural networks in feature extraction and non-linear fitting to model KG features, representative models include ConvE [3], ConvKB [14], and RGCN [16]. However, all these models ignore the temporal information and fail to reflect the real-world change properties, resulting in lower accuracy in TKG.

2.2 Temporal Knowledge Graph Representation Learning

In recent years, temporal knowledge graph representation learning has gradually become a hot research topic. Most existing models primarily focus on extending static KG representation learning to TKG. TTransE [7] adds temporal information to the score function of the TransE and makes it satisfy the temporal information based translation constraint. HyTE [2] extends the TransH model, which projects entities and relations to a time-specific hyperplane to realize the embedding of temporal information. TA-TransE [4] represents the relation type and temporal information as a sequence of characters, then uses the LSTM to learn the time-aware representation of relation types. TComplEx [10] extends ComplEx and considers the score of each quadruple as fourth-order tensor decomposition. TeRo [21] borrows ideas from TransE and RotatE [17], which defines the temporal evolution of entity embedding as a rotation and regards relation as translation. ATiSE [22] incorporates temporal information into entity and relation representations by using additive time series decomposition and uses a multi-dimensional Gaussian distribution to represent temporal uncertainty. Inspired by diachronic word embedding, DE-SimplE [5] incorporates temporal information into diachronic entity embedding and has the capability of modeling various relation patterns. Compared to our model, these models fail to capture the rich structural information and the correlations between entity and temporal features. Another line of work on TKG representation learning employs neural networks, RE-NET [8] adopts a R-GCN based aggregator and recurrent event encoder to model the historical information. RE-GCN [12] learns the evolutional representations of entities and relations by capturing the structural dependencies and sequential patterns. However, those models focus on TKGC extrapolation task, i.e., inferring the feature facts in a sequence, which are fundamentally different from our work.

3 Preliminaries

In this section, we present the preliminaries of our work, including the definition of temporal knowledge graph and graph neural network.

3.1 Temporal Knowledge Graph

In this paper, we represent a temporal knowledge graph as $\mathcal {G}=\left\{ \left( s , r , o , t \right) \right\} \subseteq \mathcal {V} \times \mathcal {R} \times \mathcal {V} \times \mathcal {T}$, where $\mathcal {V}$, $\mathcal {R}$ and $\mathcal {T}$ indicate the sets of nodes, edges, and timestamps, respectively. Temporal knowledge graph completion (TKGC) is to solve the problem of incompleteness in TKG. Assume that the whole true facts set is $\mathcal {F}\subseteq \mathcal {V}\times \mathcal {R}\times \mathcal {V}\times \mathcal {T}$, TKG should be a subset of the whole true facts set since the incompleteness of TKG, i.e., $\mathcal {G}\subseteq \mathcal {F}$. TKGC is the reasoning from $\mathcal {G}$ to $\mathcal {F}$. According to the time range, TKGC has two settings, interpolation and extrapolation. Given a temporal knowledge graph $\mathcal {G}$ with timestamps t range from $t_{1}$ to $t_{T}$, for the interpolation setting, TKGC predicts missing fatcs with $t_{1} <t <t_{T}$; In contrast, for the extrapolation setting, TKGC predicts missing fatcs with $t \>t_{T}$, i.e., predicting future facts based on past ones. More formally, the purpose of TKGC is to predict either the subject in a given query $\left( ?,r,o,t \right) $ or the object in a given query $\left( s,r,?,t \right) $. Our work is focus on the TKGC for the interpolation settings.

3.2 Graph Neural Network

Graph neural network (GNN) enjoys several advantages such as the ability to effectively handle non-Euclidean data, which makes it a great success in processing graph data. The core idea of GNN is the message propagation mechanism, i.e., the central node features are constructed by aggregating information from neighbors. In order to obtain the features of the central node i through multiple layers of GNN, each GNN layer will implement the following two steps: (1) Message Propagation, get messages from all neighbors of node i; (2) Message Aggregation, aggregate messages from all neighbor nodes then combines with the features of node i in the previous layer to obtain the features in the current layer. The above processes are defined as follows:

$$\begin{aligned} \textbf{h}^{l}_{\mathcal {N}_{i}^{k} }\leftarrow AGG \left( \left\{ \textbf{h}^{l-1}_{j},\forall j\in \mathcal {N}_{i}^{k} \right\} \right) \end{aligned}$$

(1)

$$\begin{aligned} \textbf{h}^{l}_{i}\leftarrow \sigma \textbf{W}^{l}\left( \textbf{h}^{l-1}_{i}\left| \right| \textbf{h}^{l}_{\mathcal {N}_{i}^{k} } \right) \end{aligned}$$

(2)

Steps (1) and (2) correspond to the Eqs. 1 and 2, respectively. Where $\mathcal {N}_{i}^{k}$ denotes the k neighbors of node i, $h_{i}^{l}$ denotes the hidden layer state of node i at l-th layer, and AGG is a specific function for aggregating the features of neighbors, which can be implemented using long short term memory (LSTM), self-attention mechanisms, etc. In this paper, we use a decoupled attention approach to implement AGG, which is able to capture more extensive features. The representative GNN models include graph convolutional networks (GCN) and graph attention networks (GAT), both of which assign weights to neighbors explicitly or implicitly during the aggregating features.

4 Our Approach

The Fig. 2 depicts the architecture of our model. Overall, the model is based on the encoder-decoder architecture. The encoder module maps entities into a continuous low-dimensional vector space and incorporates structural and temporal features simultaneously. In view of the fact that the relations are usually irrelevant to the temporal information, the temporal features are integrated into the vector of the entity in our model. Since the different relation types have different impacts on subject, the encoder module first integrates the relation features into the objects according to the type attention weights, then employs a decoupled attention method to learn the interactions between the subjects and objects in terms of structure and time. Finally, the quadruple based $\left( s,r,o,t \right) $ is converted into the triple $\left( s_{t},r,o_{t} \right) $, decoder module can directly evaluate triples using the static KG embedding methods.

4.1 Encoded Relation Information

Assume that there are $\left| \mathcal {R} \right| $ relation types and $\left| \mathcal {V} \right| $ entities in the temporal knowledge graph $\mathcal {G}$, the initial vectors of all entities and relations are represented as sets $\textrm{E} =\left\{ \textbf{e}_{i} \right\} _{i=1}^{\left| \mathcal {V} \right| }$ and $\textrm{R} =\left\{ \textbf{r}_{i} \right\} _{i=1}^{\left| \mathcal {R} \right| }$ respectively, where $\textbf{e}_{i} \in \mathbb {R}^{d_{e}}$ represents the initial vector of i-th entity and $\textbf{r}_{i} \in \mathbb {R}^{d_{r}}$ represents the initial vector of i-th relation, $d_{e}$ and $d_{e}$ represent the initial vectors dimension of entity and relation respectively. Given a quadruple $\left( s , r , o ,t \right) $, according to the inherent characteristics of time, i.e., information about future events cannot influence the ones of the present moment, the temporal neighbors of subject $ s $ are denoted as $\mathcal {N}_{s}^{t_{k}<t}=\left\{ (r_{i},o_{j},t_{k})|\left( s,r_{i},o_{j},t_{k} \right) \in \mathcal {G},t_{k}< t \right\} $. Since various relation types have different effects on the subject, we combine the subject vector $\textbf{e}_{{s}}$, relation vector $\textbf{r}_{{i}}$, and the object vector $\textbf{e}_{{j}}$ together and calculate the attention weights by the $\textrm{sofmax}$ function. Finally, the relation feature is incorporated into the corresponding object vector, where the attention weights are calculated as follows.

$$\begin{aligned} \textbf{u}_{r_{i},o_{j}} =\textbf{W}_{1} \left( \textbf{e}_{{s}}\left| \right| \textbf{r}_{{i}}\left| \right| \textbf{e}_{{j}} \right) \end{aligned}$$

(3)

$$\begin{aligned} \alpha _{i,j}=\textrm{softmax} \left( \textbf{u}_{r_{i},o_{j}}\right) = \frac{\mathrm {exp \left( \sigma \left( \textbf{p} \cdot \textbf{u}_{r_{i},o_{j}} \right) \right) } }{ {\textstyle \sum _{(r_{m},o_{n},t_{i})\in \mathcal {N}_{s}^{t_{i}<t}}^{}} \textrm{exp} \left( \sigma \left( \textbf{p}\cdot \textbf{u}_{r_{m},o_{n}} \right) \right) } \end{aligned}$$

(4)

where $\textbf{W}_{1}\in \mathbb {R}^{d_{e}\times (2d_{e}+d_{r})}$, $\textbf{p}\in \mathbb {R}^{d_{e}}$ are parameters learned during the model training, $\sigma $ employs the $\textrm{LeakyReLU}$ activation function. After obtaining the attention weights $\alpha _{i,j}$ of the relation type, the temporal neighbors vectors that incorporated relation types features are calculated as follows:

$$\begin{aligned} \textbf{x}_{i,j}=\alpha _{i,j}\textbf{W}_{2}\left( \textbf{r}_{{i}}\left| \right| \textbf{e}_{{j}} \right) \end{aligned}$$

(5)

where $\textbf{W}_{2}\in \mathbb {R}^{d_{e}\times (d_{e}+d_{r})}$ is model parameter matrix.

4.2 Encoded Temporal Information

Having obtained the vectors of entitits that incorporated the relations information, our aim is to further integrate the temporal information. Since the TKG’s structure are no longer static and the entity features may change, the time encoding should be able to show temporal characteristics, e.g. the events that happened a long time ago have less impact on the current events. We employ the time encoding function mapping from the time domain to the continuous differentiable functional domain proposed by literature [23], which is based on Bochner’s Theorem and can be compatible with gradient descent in model training, we denoted it as $\varPhi (t)$ and the definition as follows:

$$\begin{aligned} t\rightarrow \varPhi (t):= \sqrt{\frac{1}{d_{t}}}\left[ \cos (\omega _{1}t),\sin (\omega _{1}t),...,\cos (\omega _{n}t),\sin (\omega _{n}t) \right] \end{aligned}$$

(6)

where $\omega =\left[ \omega _{1},...,\omega _{d_{t}}\right] ^{ T }$ are learnable parameters.

4.3 Encoded Structural Information

Since the topology of the TKG contains important information, we borrow the core idea of GNN, i.e., using message propagation mechanism to capture the structural information. In order to aggregate the messages from neighbors coupled with attention weights, we adopt the decoupled attention method based on self-attention mechanism.

Given a quadruple $\left( s , r , o ,t \right) $, the temporal neighbors of subject $ s $ are $\mathcal {N}_{s}^{t_{k}<t}$. At time t, the vector of the subject $ s $ at layer l-th is represented as $\textbf{h}^{l}$, when $l=1$, $\textbf{h}^{l}=\textbf{e}_{s}$, i.e., the initial vector of $ s $. The subject s corresponding object under relation $ r _{j}$ is $ o _{i}$, and its vector at lth layer is represented as $\textbf{h}^{l}_{i}$, when $l=1$, $\textbf{h}^{l}_{i}=\textbf{x}_{i,j}$, which is obtained by the encoded relation module. Since the relative time, rather than absolute time, usually reveals critical temporal information, we directly encode the relative time $\left\{ t-t_{1},t-t_{2},..,t-t_{k} \right\} $ using the time encoding function, then we obtain the temporal encoding of neighbors $\left\{ \varPhi (t-t_{1}),\varPhi (t-t_{2}),...,\varPhi (t-t_{k}) \right\} $, where k denotes the number of neighbors of s at time t.

The traditional self-attention mechanism are used to process sequence structure, which add or combine the two vectors that are used to represent the content and position information of the token to construct its feature. However, this approach can’t effectively capture the correlation between content and position features. Inspired by DeBERTa [6], we apply time encoding to replace position encoding and calculate the weights by decoupled attention method.

The query vector at layer l is $\textbf{q}=\textbf{W}_{q}\textbf{h}^{l-1}$,$\textbf{W}_{q}\in \mathbb {R}^{d_{h}\times d_{e}}$ is the model parameter matrix, the vector of temporal neighbours and temporal encoding are constructed as matrices $\textbf{Z}_{E}$ and $\textbf{Z}_{T}$ respectively, which are represented at the $l-1$ layer as:

$$\begin{aligned} \textbf{Z}_{E}=\left[ \textbf{h}_{1}^{(l-1)},\textbf{h}_{2}^{(l-1)},...,\textbf{h}_{k}^{(l-1)} \right] \in \mathbb {R}^{d_{e} \times k } \end{aligned}$$

(7)

$$\begin{aligned} \textbf{Z}_{T}=\left[ \varPhi (t-t_{1}),\varPhi ( t-t_{2}),...,\varPhi (t-t_{k}) \right] \in \mathbb {R}^{d_{t} \times k} \end{aligned}$$

(8)

Applying linear transformation on matrices $\textbf{Z}_{E}$ and $\textbf{Z}_{T}$:

$$\begin{aligned} \textbf{K}=\textbf{W}_{K}\textbf{Z}_{E}, \textbf{P}=\textbf{W}_{T}\textbf{Z}_{T}, \textbf{V}=\textbf{W}_{V}\textbf{Z}_{E} \end{aligned}$$

(9)

where $\textbf{W}_{K},\textbf{W}_{V} \in \mathbb {R}^{d_{h}\times d_{e}}$, $\textbf{W}_{T} \in \mathbb {R}^{d_{h} \times d_{t}}$ are model parameters, the attention matrix obtained by the decoupled attention approach as following:

$$\begin{aligned} \widetilde{\textbf{A}}_{0,j}= \left[ \textbf{q} \right] ^{\textsf{T}}\textbf{K}_{j} + \left[ \textbf{q} \right] ^{\textsf{T}}\textbf{P}_{j} \end{aligned}$$

(10)

the attention matrix $\widetilde{\textbf{A}}\in \mathbb {R}^{1\times k}$, where $\textbf{K}_{j}$ and $\textbf{P}_{j}$ denote the j-th column of the matrix $\textbf{K}$ and $\textbf{P}$ respectively. In the process of calculating attention, $\left[ \textbf{q} \right] ^{\textsf{T}}\textbf{K}_{j}$ is used to capture the correlation between the subject $ s $ and the j-th neighbour object in terms of structure, and $\left[ \textbf{q} \right] ^{\textsf{T}}\textbf{P}_{j}$ is used to capture the correlation between the subject $ s $ and the j-th neighbour object in terms of time, the final attention matrix is obtained by adding the two above. We apply the $\textrm{softmax}$ function to get the weights, then the final feature vector of temporal neighbors is obtained by weighted sum.

$$\begin{aligned} \textbf{h}^{l}_{\mathcal {N}_{s}^{< t}}=\textrm{softmax}\left( \frac{\widetilde{\textbf{A}}_{0,j}}{\sqrt{2d_{h}} } \right) \textbf{V} \end{aligned}$$

(11)

In order to maintain the original features of the subject s, we concatenate the final feature vector of temporal neighbors with the s hidden vector at $(l-1)$-th layer, then pass it to a multilayer perceptron to capture non-linear interactions.

$$\begin{aligned} \textbf{h}^{l}=\textrm{MPL}\left( \textbf{h}^{l}_{\mathcal {N}_{s}^{\tau _{k}< \tau }}\left| \right| \textbf{h}^{l-1}\right) =\textrm{ReLU}\left( \left[ \textbf{h}^{l}_{\mathcal {N}_{s}^{\tau _{k}< \tau }}\left| \right| \textbf{h}^{l-1} \right] \textbf{W}^{l}_{0}+\textbf{b}^{l}_{0} \right) \textbf{W}^{l}_{1}+\textbf{b}^{l}_{1} \end{aligned}$$

(12)

$$\begin{aligned} \textbf{W}^{l}_{0}\in \mathbb {R}^{2d_{h}\times d_{h}}, \textbf{b}^{l}_{0} \in \mathbb {R}^{d_{h}}, \textbf{W}^{l}_{1}\in \mathbb {R}^{d_{h} \times d_{o}}, \textbf{b}^{l}_{1} \in \mathbb {R}^{d_{o}} \end{aligned}$$

where $\textbf{W}^{l}_{0}$, $\textbf{b}^{l}_{0}$, $\textbf{W}^{l}_{1}$ and $\textbf{b}^{l}_{1}$ are model parameters, $d_{o}$ denotes the dimension of the final output vector. We also show that the proposed model can be easily extended to the multi-head setting which can improve performance and stability. Suppose there are m different $\textrm{head}$, and $\textrm{head}^{\left( i\right) }=\textbf{h}^{l\left( i\right) }_{\mathcal {N}_{s}^{t_{k}<t}}$, we concatenate the m $\textrm{head}$ outputs with s and then carry out the same procedure as Eq. 12.

$$\begin{aligned} \widetilde{\textbf{h}}^{l}=\textrm{MPL}\left( \textrm{head}^{\left( 1 \right) }\left| \right| ,...,\left| \right| \textrm{head}^{\left( m \right) }\left| \right| \textbf{h}^{l-1}\right) \end{aligned}$$

(13)

4.4 Decoder and Training

Given a quadruple $\eta =\left( s , r , o ,t \right) $, the encoder module of the TKGAT provides vectors with temporal information $\left( \tilde{\textbf{s}_{t}},\textbf{r},\tilde{\textbf{o}_{t}} \right) $. Since the temporal information has been incorporated into the entity vector, the static KG model score function can be used to evaluate the triples. Among the currently existing methods, TKGAT adopts ConvKB as the decoder, the score function defined as following:

$$\begin{aligned} f \left( \eta \right) =\left( \mathop {||}\limits _{n=1}^{\left| \varOmega \right| }g\left( \left[ \textbf{s}_{t},\textbf{r},\textbf{ o}_{t} \right] *\omega ^{n} \right) \right) \textbf{W} \end{aligned}$$

(14)

where $\varOmega $ denotes the set of convolution kernels, $\omega ^{n}$ denotes the n-th convolution kernel, and $\omega \in \varOmega $. $\textbf{W}_{c}$ denotes the parameters matrix of the linear transformation, $\varOmega $ and $\textbf{W}_{c}$ share parameters during the model training, the activation function $g(\cdot )$ employs $\textrm{ReLU}$, $*$ denotes the convolution operation. The output vectors of the $\left| \varOmega \right| $ convolution operations are concatenated into a single vector, then linear transformation is applied to obtain the final score.

During the model training, the parameters of are learned using gradient-based optimization in mini-batches. For each quadruple $\eta =\left( s,r,o,t\right) \in \mathcal {G} $, we sample a negative set of entities $S=\left\{ o'|(s,r,o',t) \not \in \mathcal {G}\right\} $, then the cross-entropy loss function is used to train the model, which defined as follows:

$$\begin{aligned} \mathcal {L}=-\sum _{\eta \in \mathcal {G}}\frac{\textrm{exp}\left( f (s,r,o,t)\right) }{\textrm{exp}\left( {\textstyle \sum _{o'\not \in \mathcal {G}}} f (s,r,o',t)\right) } \end{aligned}$$

(15)

Note that, without losing generality, we used the above loss and negative samples for subject queries. The algorithm 1 shows the training process in detail.

5 Experiments

In this section, to verify the effectiveness of the proposed model, we conduct experiments on link prediction tasks on three public datasets. We first introduce the experimental setup, including datasets, evaluation metrics, baselines, and implementation, and then analyze the experimental results. Furthermore, we perform several ablation studies to demonstrate the effectiveness of each main component of the proposed model.

5.1 Experimental Setup

Datasets. We evaluate our proposed models on the link prediction tasks, and three public TKGs datasets are used in our experiments. The statistics of the datasets are summarised in Table 1. For the Integrated Crisis Early Warning System (ICEWS) dataset, we use two subsets provided by [4]: ICEWS14, corresponding to facts in 2014, and ICEWS05-15, corresponding to facts between 2005 and 2015. For the Global Database of Events, Language, and Tone (GDELT) dataset, we use subsets which corresponding to facts from 1 April 2015 to 31 March 2016, each piece of data has a corresponding timestamp. We use the same splits of training, validation, and testing sets as provided by [5].

Evaluation Metrics. For each quadruple $\left( s,r,o,t\right) \in \mathcal {D}_{test}$, where $\mathcal {D}_{test}$ represents the test dataset, we generate two queries: $\left( s,r,?,t\right) $ and $\left( ?,r,o,t\right) $. For the first query, the model evaluates all entities and obtains scores $ f (s,r,o',t)$, $ \forall o'\in \mathcal {E}$, with an analogous approach used for the second query. According to the final scores, the rank of the given quadruple is obtained, and we report mean reciprocal rank $\left( MRR\right) $ which is defined as:

$$\begin{aligned} MRR=\frac{1}{2\left| \mathcal {D}_{test} \right| } \sum _{\eta \in \mathcal {D}_{test}}\left( \frac{1}{rank\left( o|s,r,t \right) }+\frac{1}{rank\left( s|r,o,t \right) } \right) \end{aligned}$$

(16)

where $\eta =\left( s,r,o,t \right) $, $\left| \mathcal {D}_{test} \right| $ denotes the size of the test dataset. We also report Hits@1, Hits@3, and Hits@10 measures where Hits@k represents the percentage of correct quadruple in the k highest ranked predictions, Hits@k defined as:

$$\begin{aligned} Hit@k=\frac{1}{2\left| \mathcal {D}_{test} \right| } \sum _{\eta \in \mathcal {D}_{test}}\mathbb {I}_{\left( rank\left( o|s,r,t \right) \le k \right) }+\mathbb {I}_{\left( rank\left( s|r,o,t \right) \le k \right) } \end{aligned}$$

(17)

where $\mathbb {I}_{\left( \cdot \right) }$ is an indicator function, $\mathbb {I}_{\left( cond\right) }$ is 1 if cond holds and 0 otherwise.

Table 1. Statistics of datasets.

Full size table

Baselines. We test the performance of the proposed model against a variety of strong baselines, including static KG representation learning models and TKG representation learning models. Note that all these static models are applied without considering the time information in the input, including: TransE [1], DistMult [24], ComplEx [18], and SimplE [9]. The other TKG representation learning baselines models include: TTransE [7], HyTE [2], TA-TransE [4], DE-SimplE [5], ATiSE [22], and TeRo [21]. As TGAT [23] is specifically designed to handle dynamic network graphs not TKG, we have not compared with it.

Table 2. Evaluation results on link prediction. The best results are in bold and the second-best results are underlined.

Full size table

Implementation. We implemented our model and the baselines in PyTorch and conducted the experiments on an NVIDIA Tesla V100 GPU. The vectors dimension of the entity, relation, and time are fixed to 128. We also tried to use different score functions to train the model, finally, we chose the ConvKB model as our decoder. The number of temporal neighbors samples is set to 20 for ICEWS14 and ICEWS05-15 datasets, 50 for the GDELT dataset. Theoretically, the information from multi-hop neighbors can be aggregated in our model, to speed up training, only the information about the 2-hop neighbors is aggregated. The number of attention heads and negative samples is set to 4 and 200 respectively, and the Adam SGD optimizer is applied to train model, we set 0.001 as the learning rate for all datasets.

5.2 Results and Analysis

Table 2 shows the experimental results of link prediction on ICEWS14, ICEWS05-15, and GDELT datasets. From the result, we can observe that the static KG representation learning models fell behind TKG models in most cases. The primary reason is static KG models only learned one representation for each entity or relation, without taking into account the temporal information.

The results also demonstrate the state-of-the-art performance of our approach for link prediction tasks. As we can see, the TKGAT model significantly improves on the suboptimal TeRo model for most metrics. The typical TKG representation learning models DE-SimplE, ATiSE, and TeRo, which pay more attention to model temporal information while ignoring to capture of the TKG topology structural information. In contrast, our model is based on the GNN framework, which has the advantage of building structural features. Besides, our model adopted attention networks to model relation weights and decoupled attention is applied to incorporate more extensive TKG structural features, which allowed our model accurately to describe entities and relations characteristics. TKGAT obtained central entity features by aggregating temporal neighbours, a large number of network parameters were used to learn the features, which increased a little model complexity but improved the accuracy. Meanwhile, time encoding function based on Bochner’s theorem was employed to model relative time features, which further improved the model performance.

The experimental results also exhibt that the improvement in ICEWS05-15 and GDELT is greater than ICEWS14 dataset. the main reason is the comparatively small scale of the ICEWS14 dataset, in order to achieve the best prediction results, a large amount of training data is required. In addition, the results show that the model performance on the ICEWS14 and ICEWS05-15 datasets are better than those on the GDELT datasets, the major reason is the quite small scale of the entities and relation types in the GDELT dataset, however, the interactions between entities are extremely complex, which makes challenging to extract effective information from the extremely complex interactions. Furthermore, the quality of the GDELT dataset is slightly lower, resulting in a relatively lower accuracy.

5.3 Ablation Study

To verify the effectiveness of each component in TKGAT, firstly, we implemented a version of TKGAT with all temporal attention weights set to the same value (-Time) to prove the validity of the time encoding function based on Bochner’s theorem. Secondly, we removed the decoupled attention module (-Decoupled) and adopted the traditional self-attention mechanism directly to calculate attention scores between different entities. Finally, we incorporated relations information directly into the object using a linear transformation (-Linear) to verify the effectiveness of modeling relation weights.

As shown in Fig. 3, the TKGAT-Time model significantly reduced on MRR metric in all datasets, which proved the effectiveness of the time encoding function, and we can also notice that building temporal features in TKG is essential. In addition, the results show that the TKGAT-Decoupled model performed worse than the TKGAT model, which proved that the decoupled attention method is beneficial for improving the performance of the attention mechanism, and the correlations between entity and temporal features captured by decoupled attention are effective for TKG representation learning. We can also observe that the TKGAT-Linear model worked slightly worse than the TKGAT model, which indicates the effectiveness of capturing relations weights.

6 Conclusion

In this paper, we present a novel model, called TKGAT, for temporal knowledge graph representation learning. Specifically, time encoding function based on Bochner’s theorem was applied to efficiently model relative time information, decoupled attention was adopted to capture the correlations between entity and temporal features, and the different relations influences were learned by attention network. Experimental results show that the TKGAT can effectively model temporal knowledge graph features. The ablation study also demonstrates the effectiveness of each component of TKGAT. For future work, the generation of time-aware discriminative negative samples is worth exploring.

References

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 2787–2795. NIPS’13, Curran Associates Inc. (2013)
Google Scholar
Dasgupta, S.S., Ray, S.N., Talukdar, P.: Hyte: Hyperplane-based temporally aware knowledge graph embedding. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2001–2011. EMNLP’18, Association for Computational Linguistics (2018)
Google Scholar
Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge graph embeddings. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 1811–1818. AAAI’18, AAAI Press (2018)
Google Scholar
García-Durán, A., Dumančić, S., Niepert, M.: Learning sequence encoders for temporal knowledge graph completion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4816–4821. EMNLP’18, Association for Computational Linguistics (2018)
Google Scholar
Goel, R., Kazemi, S.M., Brubaker, M., Poupart, P.: Diachronic embedding for temporal knowledge graph completion. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, pp. 3988–3995. AAAI’20, AAAI Press (2020)
Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020)
Jiang, T., et al.: Encoding temporal information for time-aware link prediction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2350–2354. EMNLP’16, Association for Computational Linguistics (2016)
Google Scholar
Jin, W., Qu, M., Jin, X., Ren, X.: Recurrent event network: Autoregressive structure inferenceover temporal knowledge graphs. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6669–6683. Association for Computational Linguistics, Online (2020)
Google Scholar
Kazemi, S.M., Poole, D.: Simple embedding for link prediction in knowledge graphs. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4284–4295. NeurIPS’18, Curran Associates, Inc. (2018)
Google Scholar
Lacroix, T., Obozinski, G., Usunier, N.: Tensor decompositions for temporal knowledge base completion. In: International Conference on Learning Representations, pp. 1–12. ICLR’20 (2020)
Google Scholar
Li, Z., Liu, X., Wang, X., Liu, P., Shen, Y.: Transo: a knowledge-driven representation learning method with ontology information constraints. World Wide Web, pp. 1–23 (2022)
Google Scholar
Li, Z., et al.: Temporal knowledge graph reasoning based on evolutional representation learning. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 408–417 (2021)
Google Scholar
Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2181–2187. AAAI’15, AAAI Press (2015)
Google Scholar
Nguyen, D.Q., Nguyen, T.D., Nguyen, D.Q., Phung, D.: A novel embedding model for knowledge base completion based on convolutional neural network. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 327–333. NAACL’18, Association for Computational Linguistics (2018)
Google Scholar
Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, pp. 809–816. ICML’11, Omnipress (2011)
Google Scholar
Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., Navigli, R., Vidal, M.-E., Hitzler, P., Troncy, R., Hollink, L., Tordai, A., Alam, M. (eds.) The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_38
Chapter Google Scholar
Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: Rotate: Knowledge graph embedding by relational rotation in complex space. In: Proceedings of the Seventh International Conference on Learning Representations, pp. 328–337. ICLR’19 (2019)
Google Scholar
Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., Bouchard, G.: Complex embeddings for simple link prediction. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, pp. 2071–2080. ICML’16, JMLR.org (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 1112–1119. AAAI’14, AAAI Press (2014)
Google Scholar
Xu, C., Nayyeri, M., Alkhoury, F., Shariat Yazdi, H., Lehmann, J.: Tero: A time-aware knowledge graph embedding via temporal rotation. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1583–1593. COLING’20, International Committee on Computational Linguistics (2020)
Google Scholar
Xu, C., Nayyeri, M., Alkhoury, F., Yazdi, H.S., Lehmann, J.: Temporal knowledge graph embedding model based on additive time series decomposition. arXiv preprint arXiv:1911.07893 (2019)
Xu, D., Ruan, C., Körpeoglu, E., Kumar, S., Achan, K.: Inductive representation learning on temporal graphs. In: 8th International Conference on Learning Representations. ICLR’20 (2020)
Google Scholar
Yang, B., Yih, S.W.t., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: Proceedings of the third International Conference on Learning Representations, pp. 809–816. ICLR’15 (2015)
Google Scholar
Zhang, F., Wang, X., Li, Z., Li, J.: Transrhs: A representation learning method for knowledge graphs with relation hierarchical structure. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 2987–2993. IJCAI’20, International Joint Conferences on Artificial Intelligence Organization (2020)
Google Scholar

Download references

Acknowledgment

This work is supported by the National Key R &D Program of China (2020AAA01 08504), the Key Research and Development Program of Ningxia Hui Autonomous Region (2023ZDYF0574), and the National Natural Science Foundation of China (61972275).

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Shaowei Zhang, Zhao Li, Xin Wang, Zirui Chen & WenBin Guo

Authors

Shaowei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zirui Chen
View author publications
You can also search for this author in PubMed Google Scholar
WenBin Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Wang .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, S., Li, Z., Wang, X., Chen, Z., Guo, W. (2023). TKGAT: Temporal Knowledge Graph Representation Learning Using Attention Network. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14177. Springer, Cham. https://doi.org/10.1007/978-3-031-46664-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-46664-9_4
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46663-2
Online ISBN: 978-3-031-46664-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics