Keywords

1 Introduction

Knowledge graphs (KGs) is widely used in natural language processing applications due to its ability to represent structured knowledge. However, most KGs suffers from incompleteness, which limits the performance and scope of KG applications to a certain extent. Therefore, it is an important task to predict the missing facts by KG reasoning. Recently, inference task has been expanded from KGs to a more challenging field: Temporal Knowledge Graphs(TKGs). The inference task on TKGs can be simply expressed as predicting missing object entity in query (subject entity, relation, ?, timestamp). Most of their researches focus on how to effectively integrate temporal information into the model. It’s suggested by some researches that the traditional KG embedding method can be extended to TKGs [1, 2]. However, they only pay attention to learning the potential representation of entity in a snapshot, while ignoring modeling the dynamic evolution process of entity. Recently, some researchers have also conducted researches on the dynamic evolution of entity and relation [3,4,5,6], which combine historical information from previous snapshots.Another part of researchers try to enrich the potential representations of entity by incorporating multiple information [7,8,9], such as entity descriptions, event descriptions and uncertain information.

Fig. 1.
figure 1

Illustrates the process of ST-Net leveraging spatial information to predict future facts.

Inspired by this, we find that the above methods all ignore the importance of the spatial information (country, city, organization) attached to the entity. In fact, the occurrence of an event cannot be separated from spatial information, and the interaction between two entities also means the interaction between two spaces. Specifically, we found that the probability of entities in same event located in same space is 41.13% according to ICEWS18 dataset, which shows the importance of leveraging the spatial information contained in the entity to predict future facts.

To this end, we propose a innovative TKG representation learning model with both temporal and spatial perception. Mining the spatial information attached to the entity would help in fine-grained modeling of entity representation, and further strengthen the representation ability of entity embedding. At the same time, we introduce two inference modes in the copy generation network [6] to predict future facts from the historical vocabulary and the whole entity vocabulary. As shown in Fig. 1, after applying the spatial information, two nodes without interaction, Farm Worker and Hindustan Times, can be connected in the same space to pave the way for prediction.

We evaluated our proposed method on three benchmark datasets and the experimental results show that ST-Net is superior to the baseline model in link prediction task. The ablation experiment further proves that spatial information can help the model predict future facts better.

The main contributions of this work are as follows:

  1. 1.

    We propose a innovative TKG representation learning model ST-Net, which implements fine-grained modeling of entity representation by mining the spatial information attached to entity.

  2. 2.

    We introduce two inference modes in the Copy-Generation Networks [6], which predict future facts from the historical vocabulary and the whole entity vocabulary respectively.

  3. 3.

    We conduct extensive experiments on three public TKG datasets and demonstrate the effectiveness of ST-Net in link prediction.

2 Related Work

We discuss three relevant research directions. Since there is massive work in each direction, we can only select representative and closely related ones to elaborate.

2.1 Static Knowledge Graph Embeddings

Without considering temporal facts, researchers have made passable progress in KG embedding, which Ji et al. [10] summarizes. A classic class of models is translation model (TransE and its variants) [11,12,13], which leverages a distance-based scoring function to measure the reasonableness of facts. The other is the semantic matching model (such as ComplEx [14], DistMult [15], etc.), which uses a similarity-based scoring function and measure the plausibility of facts through semantic matching. Recently, some methods based on deep neural networks (such as R-GCN [16], ConvE [17], RSN [18], etc.) have emerged, which utilize CNN, RNN, and graph neural networks (GCN, R-GCN, etc.) to help models learn embedding of entity and relation. However, these methods cannot capture temporal facts.

2.2 Temporal Knowledge Graph Embeddings

TKG embedding incorporates timestamp of facts into the learning process. Some researchers attempt to extend the static KGs directly to the field of TKGs. TTransE [1] is an extension of TransE, which simply integrates time information into the scoring function; HyTE [2] replaces the unit normal vector of the hyperplane projection in TransH with a time-specific normal vector. Other researchers focus on the dynamic evolution of entity and relation. Know-Evolve [3] models the occurrence of facts as a temporal point process to learn non-linearly evolving entity representation over time; Goel et al. [4] provides the embedding of an entity at any point in time by equipping the static model with a diachronic entity embedding function. However, none of these methods correlate snapshots that have occurred in history.

To solve the problem that the model cannot capture the long-term dependency of facts, the autoregressive model is proposed. Jin et al. [5] modeled the TKGs in the way of autoregressive, that is, the snapshot at T timestamp depends on the historical snapshot before T; Han et al. [19] leverages continuous temporal embedding to encode the temporal and structure information of historical snapshots; Zhu et al. [6] utilizes the recurrence rule of facts and combines two inferring modes to predict future facts from historical vocabulary and whole entity vocabulary respectively.

2.3 Embedding with Auxiliary Information

More recent attempts have been made to combine multimodal information with structured information in KGs to promote more effective knowledge representation. SSP [8] provides more precise semantic embedding for entity and relation by projecting triples and text descriptions into semantic subspaces. Glean [7] leverages graph neural networks to fuse unstructured event descriptions and structured data from TKGs to enrich the hidden features of event participants. The uncertain KG embedding model proposed by Chen et al. [9] incorporates the confidence scores of uncertain relation facts when learning embedding. Influenced by this, we try to fuse the spatial information contained in the entity with the structured information of the TKGs. As we know, this is the first work to incorporate spatial information into TKG representation learning.

3 Spatial-Temporal Network

In this section, we start with the notations for building our model and problem definition, and then introduce the model architecture as well as its training and inference procedures in detail. Figure 2 illustrates an overview of ST-Net reasoning process. ST-Net can be decomposed into 3 sub-modules: Query Vectorization, Capture History Vocabulary, and Copy-Generation Mode.

3.1 Notations

TKGs consists of temporal facts, each of which can be simply described as subject and object entity \(s \in \mathcal {E}\) and \(o \in \mathcal {E}\) have a relation \(r \in \mathcal {R}\) at timestamp \(t \in \mathcal {T}\), denoted as quadruple (s,r,o,t), where \(\mathcal {E}\), \(\mathcal {R}\) represent the vocabularies corresponding to the entities and relations respectively, and \(\mathcal {T}\) is the set of timestamps. The boldface s,r,o,t represent the corresponding embedding vectors. A TKG can be divided into a set of snapshots \(\{G_1,G_2,\cdots ,G_{t_k}\}\) according to the timestamps, where \(G_t\) is a snapshot of the TKG at time step t, containing all temporal facts at time step t. For each subject entity and relation pair at timestamp \(t_k\), we define a delimited subset of \(\mathcal {E}\) specific to \((s,r,t_k)\) as \({\textbf {H}}_{t_k}^{(s,r)}\), namely historical vocabulary for \((s,r,t_k)\), which contains all object entities in temporal facts with the subject entity s and the relation r in the known snapshots \(G_{(t_1,t_{k-1})}=\{G_{t_1},G_{t_2},\cdots ,G_{t_{k-1}}\}\) before \(t_k\), where the historical vocabulary \({\textbf {H}}_{t_k}^{(s,r)}\) is an N-dimensional multi-hot indicator vector and N is the cardinality of \(\mathcal {E}\), the value of entities in the historical vocabulary are masked 1 while others are 0.

Fig. 2.
figure 2

Overview of ST-Net. ST-Net implement fine-grained modeling of entity representation by mining spatial information of entity, and conbines two inference modes to predict missing entity in the query. In the figure, light purple nodes are the candidate object entities in the historical vocabulary for query \((s_1,r_2,?,t_q)\).

Prediction of a missing temporal fact aims to infer the quadruple \((s,r,?,t_q)\) or \((?,r,o,t_q)\) from the known snapshots \(G_{(t_1,t_{q-1})}=\{G_1,G_2,\cdots ,G_{t_{q-1}}\}\). Without loss of generality, the task of the model is defined as predicting missing object entity.

3.2 Model Components

Query Vectorization. In encoding phase, the information contained in query needs to be converted into a continuous low-dimensional vector by the embedding layer. ST-Net first randomly initializes entity feature \({\textbf {s}}_i\) and relation feature \({\textbf {r}}_i\) for all \(s \in \mathcal {E}\) and all \(r \in \mathcal {R}\). In order for the model to be time-aware, the timestamp in temporal facts need to be encoded. Define the embedding for a unit step of time as \({\textbf {t}}_u\) and \({\textbf {t}}_1 = {\textbf {t}}_u\), so the embedding of timestamp \({\textbf {t}}_k\) is represented as follows:

$$\begin{aligned} {\textbf {t}}_k = {\textbf {t}}_{k-1} + {\textbf {t}}_u \end{aligned}$$
(1)

ST-Net expands the spatial-aware based on time-aware. Due to the complexity and uncertainty of data, the spatial information attached to entity is hard to learn by neural networks. We found that the text for most entities in TKGs already contains spatial information, like Citizen(India), Government(Pakistan). This is becasue these information have been summarized and sorted out in the process of temporal fact extraction, but they were not fully utilized. ST-Net obtains spatial information by preprocessing the text of the entity and the text in the raw data. After processing, for few of entities lacking spatial information, the entity itself can be directly used as its spatial information or use the existing mature pretraining language model to generate these message by prompt learning. After acquiring the spatial information for all entities, ST-Net randomly initializes them to get the spatial feature \({\textbf {sp}}\).

The discovered spatial information can support the model to carry out fine-grained modeling of entity representation, so that the entity has a richer and more favorable embedding to improve the effect of downstream tasks. After query Vectorization, the model needs to capture historical vocabulary of the query in advance to facilitate the next reasoning process.

Capture Historical Vocabulary. First, the training dataset can be divided into a series of snapshots \(G_1,G_2,\cdots ,G_{t_{train}}\) by timestamp. Then obtain the historical vocabulary for each subject entity and relation combination (srt) in each snapshot, i.e. \(\{{\textbf {h}}_{t_1}^{(s,r)}, {\textbf {h}}_{t_2}^{(s,r)},\cdots , {\textbf {h}}_{t_{train}}^{(s,r)}\}\). During training process, ST-Net is trained on each snapshot in timestamp order by incrementally maintain the historical vocabulary for all the previous snapshots. When evaluating the performance of our model on the validation set and test set, the maximum historical vocabulary from the whole training set will be used.

Specifically, for each query quadruple \((s,r,?,t_k)\) at time \(t_k\), during the training process, ST-Net will expand the historical vocabulary that specific to \((s,r,t_k)\) from the snapshot before time \(t_k\), as formalized below:

$$\begin{aligned} {\textbf {H}}_{t_k}^{(s,r)} = {\textbf {h}}_{t_1}^{(s,r)} + {\textbf {h}}_{t_2}^{(s,r)} + \cdots + {\textbf {h}}_{t_{k-1}}^{(s,r)} \end{aligned}$$
(2)

where \({\textbf {H}}_{t_k}^{(s,r)}\) is an N-dimensional multi-hot indicator vector where 1 is marked for all entities in the current historical vocabulary. Next two modes of reasoning will be introduced.

Copy-Generation Mode. The Copy mode aims to identify repeated facts and predict future facts by down sampling known facts of the same type in history. Given a query \((s,r,?,t_k)\) and its corresponding historical vocabulary \({\textbf {H}}_{t_k}^{(s,r)}\), the copy mode will increase the probability estimated for the object entity that are selected in the historical vocabulary. The Copy mode generates the query vector \(v_q\) with an MLP:

$$\begin{aligned} {\textbf {v}}_q = tanh({\textbf {W}}_c[{\textbf {s}},{\textbf {sp}},{\textbf {r}},{\textbf {t}}_k] + {\textbf {b}}_c) \end{aligned}$$
(3)

where \({\textbf {W}}_c \in \mathbb {R}^{4d \times N}\) and \({\textbf {b}}_c \in \mathbb {R}^N\) are trainable parameters. The query vector \(v_q\) is an N-dimensional vector, where N is the cardinality of the whole entity vocabulary \(\mathcal {E}\).

To minimize the probability of some entities that do not form known facts with s and r in history, we first change the index value for an uninterested entity in \({\textbf {H}}_{t_k}^{(s,r)}\) to a small negative number denoted \(\dot{{\textbf {H}}_{t_k}^{(s,r)}}\). Then, the Copy mode can add the query vector and the changed multi-hot indicator vector \(\dot{{\textbf {H}}_{t_k}^{(s,r)}}\) to limit the scope of candidate entities. After applying softmax function, the probability of the uninterested entities will be minimized.

$$\begin{aligned} {\textbf {c}}_q = {\textbf {v}}_q + \dot{{\textbf {H}}_{t_k}^{(s,r)}} \end{aligned}$$
(4)
$$\begin{aligned} {\textbf {p}}(c) = softmax({\textbf {c}}_q) \end{aligned}$$
(5)

where \({\textbf {c}}_q\) is an N-dimensional index vector. \({\textbf {p}}(c)\) is an N-dimensional probability distribution vector representing the prediction probabilities on the historical vocabulary. Finally, we use the entity with the largest probability value in \({\textbf {p}}(c)\) to answer the query. The advantage of the Copy mode is that it enable to predict from a more limited candidate entity space than the overall vocabulary. However, facts can also appear in upcoming snapshot, therefore a Generation mode is needed to predict such facts.

Given the same query \((s,r,?,t_k)\) as the copy mode, the Generation mode is responsible for selecting the object entity from the whole entity vocabulary \(\mathcal {E}\) to predict facts. The reasoning of Generation mode is that regards the predicted fact as a completely new fact without reference to the facts that have happened in history. The Generation mode also generates a query vector \({\textbf {g}}_q\) and is further normalized using the softmax function for prediction:

$$\begin{aligned} {\textbf {g}}_q = {\textbf {W}}_g [{\textbf {s}},{\textbf {sp}},{\textbf {r}},{\textbf {t}}_k] + {\textbf {b}}_g \end{aligned}$$
(6)
$$\begin{aligned} {\textbf {p}}(g) = softmax({\textbf {g}}_q) \end{aligned}$$
(7)

where \({\textbf {W}}_g \in \mathbb {R}^{4d \times N}\) and \({\textbf {b}}_g \in \mathbb {R}^{N}\) are trainable parameters. Similar to \({\textbf {p}}(c)\) in the Copy mode, \({\textbf {p}}(g)\) represents the predicted probability on the entire entity vocabulary. The maximum value in \({\textbf {p}}_g\) denotes the object entity predicted by Generation mode throughout the entity vocabulary. The Generation mode is complementary to the Copy mode, with the ability to predict entirely new facts.

3.3 Parameter Learning and Inference

Given a query (sr, ?, t) to predict the object entity can be viewed as a multi-class classification task, where each class corresponds to each object entity. The learning objective is to minimize the following cross-entropy loss \(\mathcal {L}\):

$$\begin{aligned} \mathcal {L} = - \sum _{t \in \mathcal {T}} \sum _{i \in \mathcal {E}} \sum _{k=1}^{K} o_{it} ln{\textbf {p}}(y_{ik}|s,r,t) \end{aligned}$$
(8)

where \(o_{it}\) is the i-th ground truth object entity in the snapshot \(G_t\), \({\textbf {p}}(y_{ik}|s,r,t)\) is the combined probability value of the k-th object entity in the snapshot \(G_t\) when the i-th ground truth object entity is \(o_i\).

In order to ensure that the sum of the probability equals 1 for all entities in \(\mathcal {E}\), we set a hyperparameter \(\alpha \) to adjust the weight between the Copy mode and the Generation mode, which is defined as follows:

$$\begin{aligned} {\textbf {p}}(o|s,r,t) = \alpha *{\textbf {p}}(c) + (1 - \alpha ) *{\textbf {p}}(g) \end{aligned}$$
(9)
$$\begin{aligned} o_t = argmax_{o \in \mathcal {E}} {\textbf {p}}(o|s,r,t) \end{aligned}$$
(10)

where \(\alpha \in [0,1]\), \({\textbf {p}}(o|s,r,t)\) is the final predicted probability vector, which contains the probabilities of all entities under the current query.

4 Experiments

In this section, we evaluate our proposed method on the link prediction task on three public TKG datasets. We first introduce experimental settings in detail, including details of datasets and baselines. After that, we compare and analyze the experimental results and conduct an ablation study to evaluate the importance of spatial information.

4.1 Experimental Setup

Table 1. Statistics of the datasets.

Datasets and Evaluation Metrics.

We evaluate ST-Net on three benchmark datasets for link prediction: ICEWS18 [20], ICEWS14 [21] and ICEWS05-15 [21]. Table 1 provides a summary of these datasets statistics. We divide each dataset into training set, validation set and testing set into 80%/10%/10% splits in the chronological order, and adopt a filited version of Mean Reciprocal Ranks(MRR) and Hits@1/3/10 to measure the performance of ST-Net.

Baselines. We compare ST-Net with multiple static knowledge graph embedding(SKGE) and temporal knowledge graph embedding(TKGE) models. The former includes TransE [11], DistMult [15], ComplEx [14], RotatE [22], and SimplE [23], while the latter includes TTransE [1], HyTE [2], TA-DistMult [21], DE-DistMult [4], DE-SimplE [4], RE-Net [5], CyGNet [6], and ATiSE [24].

Parameter Settings. The value of the hyperparameter \(\alpha \) used to tune the weight of Copy mode and Generation mode is determined based on the MRR performance on the validation set of the current dataset. After extensive experiments, we found that the model works best when \(\alpha \) is set to 0.8 on ICEWS14 and ICEWS18 and 0.9 on ICEWS05-15. The parameters of the model are initialized with Xavier initialization (Glorot and Bengio 2010) [25], and then optimized using an AMSGrad optimizer with a learning rate of 0.001. The batch size is set to 1024. The hidden layer embedding dimension is set to 200. The training epoch is limited to 30, which is enough for the model to converge in most cases. The baseline results are adopted from CyGNet [6] and ATiSE [24].

4.2 Results

Table 2. Results on ICEWS18, ICEWS14 and ICEWS05-15. The best results are in bold, and the second best ones are underlined.

Table 2 report the link predition results of ST-Net and baseline methods on three TKG datasets. We observe that all SKGE methods perfrom worse than most TKGE methods, because they cannot capture temporal information in facts and cannot model dynamic interactions of entities and relations. However, their preformance generally outperform TTransE and HyTE. The reason we believe is that TTransE and HyTE only learn the representation at this timestamp for each snapshot separately, without linking entity representations at different timestamp, so it lacks the ability to capture time sequence information.

Table 2 also show that significantly outperforms other baselines on ICEWS18 and ICEWS05-15. For further analysis, we calculated the probability of repeated events in ICEWS18 to be 49.24%, which will improve the prediction effect of the Copy mode to a certain extent. And in ICEWS18, as many as 41.3% of the groups where the subject entity and the object entity are in the same space, which also explains that the reasonable use of spatial information can improve the prediction result. We argue that spatial message between entities in a fact can also help the model learn the semantic information implied by the relation, that is, the relation often occur between those spatial locations.

The experimental results indicate that ST-Net use the spatial information attached to the entity to implement fine-grained modeling of entity representation, which can effectively enrich entity information and improve the effect for link prediction.

4.3 Ablation Study

Table 3. Results (in percentage) of ablation study on the ICEWS18.

In order to explore the impact of different acquisition of spatial information on the model, we conducted an ablation study. To do this, we create variants of ST-Net by adjusting how spatial information is obtained in the model, and compare their performance gaps with ST-Net on the ICEWS18 dataset. From the results in Table 3, we can observe the importance of spatial information. After removing the spatial information, all metrics of the model have decreased, which demonstrates that the spatial information contained in entity play a vital part for predicting future facts. In addition, ST-Net-Prompt(bert-base-cased) complete the missing spatial information of entity through the Fill-Mask task in the pre-trained language model (bert-base-cased), where the template of the Fill-Mask task is set to : [Entity text]’s country is [MASK]. Its performance ranks between ST-Net and ST-Net-No-Spatial-Information, which further illustrates the importance of spatial information and reflects the shortcomings of the method we designed to complete the spatial information through the pre-trained model.

5 Conclusion

In this paper, we explore the spatial information attached to the entity and predict the future facts by combining the copy and generation reasoning mode. The experimental results show that ST-Net has promising performance in predicting the future facts in the temporal knowledge graph. In the future work, we plan to mine more accurate and effective information in entity through prompt learning, and combine them to help the model perform better. Meanwhile, further study on the utilization of spatial information is also significant.