1 Introduction

KG is a graph-structured data model that integrates data and stores interlinked descriptions of entities. KG efficiently encodes knowledge and reasoning about complex topics. They help users grasp and model complex concepts and make better business decisions based on factual data [15]. In KG, the Resource Description Framework (RDF) is primarily used for describing linked data resources on the web. RDF uses a subject-predicate-object triplet format (\(s-r-o\)), where the subject (s) and object (o) are entities such as persons, places, locations, and events. A predicate (r) is a relation type between entity pairs, such as is_a, live_in, and located_in. These triplets are linked to create a graph of interconnected entities. Consider the triplet “ElonMusk-wasBornIn-Pretoria,” in which wasBornIn is a relation type between Elon Musk and Pretoria. Large-scale open KG such as Yago, WikiData, DBpedia, FreeBase, and others are available publicly on the web [38]. The key to KGs success is its flexibility in merging and connecting heterogeneous data regardless of their disparity. However, entities and relations may have heterogeneous descriptions, leading to incomplete triplets. Besides, KGs can be incomplete due to user-generated organic entities that include meaningless representations, invalid or incomplete attributes, stale contents, or no member that mapped entity pairs. Manually enumerating the relation types among pairs of entities is complex and time consuming [4]. Since the KG is incomplete by design, it possibly leads to unreliable query results in applications such as question-answering [22], cyber defense [14], and recommendation systems [8].

Fig. 1
figure 1

KG example from the YAGO3-10 dataset

In recent years, KG representation learning (KGRL) has focused on mapping a latent graph structure to a numerical feature vector space, also known as embedding. Those learned embeddings are used to estimate the missing links by computing the scores of all possible triplets. Progress in recent KGRL grouped into three types according to the embedding approach. The first category is the translation distance model [2], and its variants embed the triplet (\(s-r-o\)) in real-value vector space by projecting them in different spaces with different constraints. The second category is the tensor factorization model [27], and its variants capture latent semantics of KG in partially observed tensor of size \(|E|\times |E|\times |R|\), where E is the entity set, and R is the relation set. The third category is neural network models that use the multilayer perceptron and convolution neural network (CNN). ConvE [5] is a CNN model that encodes the entities and relations with the activation function to score the triplet. [15]. The existing KG embedding model learning process requires an artificially contaminated negative triplet for each true positive triplet to compute the learning loss. However, negative sampling techniques such as uniform [2], Bernoulli [34], and KBGAN [3] suffer from gradient vanishing problems affecting the model performance. Therefore, embedding models without negative sampling requirements is needed.

The learned embeddings are utilized to predict the missing entity for a given relation-entity pair (\(?-r-o\) or \(s-r-?\) ). The entity prediction approach utilizes the score function to compute the likelihood of all possible triplets created by replacing thousands of entities from the entity set E, which is a computationally intensive process. However, KG involves fewer relationship types compared to lots of entity types. Consequently, relation prediction makes missing fact inference computationally more efficient than entity prediction. This research aims to efficiently predict the missing relation \((s-?-o)\) between a given entity pair \((s, o)\). Consider the subset example from the YAGO3-10 dataset shown in Fig. 1. The knowledge graph has the triplets such as “ElonMusk-wasBornIn-Pretoria,” “ElonMusk-graduatedFrom-StanfordUniversity,” “LosAngeles-isLocatedIn-California” “ElonMusk-hasGender-Male,” and “StanfordUniversity-isLocatedIn-California.” The relation type between Elon Musk and Los Angeles is currently missing. Infer a missing relation livesIn for the given entity pair “ElonMusk-?-LosEngeles” as an additional true fact makes the perception of knowledge graph completion.

The fact that the KG structure has a higher variance than the mean of the entity degree means that the relationship types are unevenly distributed over the entity pairs. Empirically, a significant number of entity pairs are associated with most relationship types. However, some entities have more relations with other entities. Therefore, it appears that influential information is present around an entity. The graph attention network [32] uses neighborhood information to obtain the embeddings. Recent attention-based KG embedding models such as KGAE [25], GGAE [17], and DMACM [9] derive embeddings using node attention mechanisms that iteratively transmit different assigned attentions and integrate them through different aggregation functions. These models consider the several independent attentions computed by various transformation operations. Consequently, attention mechanisms can be computationally expensive when applied directly to the entire KG, especially for large graphs. Consequently, a KG embedding model with a simplified attention technique is needed to reduce complexity and enhance missing fact inference.

We propose a novel Triplet Subgraph Attention Embedding (TSAE) model for relation prediction based on the preceding analysis to tackle the fact inference problem. TSAE uses an adjacency matrix to capture the subgraph structure from the input triplet entity pair and its neighborhood entities. TSAE maps the subgraph entities from lower to higher dimensions using a transformation matrix to expand the expressiveness of the feature. The attention mechanism computes the attention for each neighbor’s entity to a triplet entity and subsequently updates the entity features using the aggregation function that involves itself and neighborhood attention-aware features. A dense feed-forward neural network processes attention-aware triplet entity features through hidden layers to compute the relative likelihood of all the relation types between the entity pairs. TSAE utilizes adjacent entity information (subgraph structure) to learn the entity features. The attention mechanism revises the features of a central entity (entity embedding) using features of all one-hop adjacent entities to take advantage of the substructure proximity. Based on the above discussion, TSAE has several advantages: (1) It limits the scope of attention to relevant subgraphs surrounding the triplets, reducing computational complexity while still capturing essential information. (2) It generates a more fine-grained embedding of the entity using simplified attention mechanisms that capture the better nuances of the neighborhood entities, leading to improved performance in inference tasks. (3) It offers more interpretable embedding because it explicitly considers the context of each triplet. Therefore, TSAE is helpful for applications where inexpensiveness, simplicity, and interpretability are essential. Key contributions are listed below:

  1. 1.

    We propose a triplet subgraph attention embedding model for the relation prediction based on the attention mechanism to update the central entity features by attending to all one-hop neighborhoods.

  2. 2.

    Design a feed-forward neural network to estimate the probability (score) of all the relation types between given entity pairs. Rank all triplets according to the respective relationship type score.

  3. 3.

    The experimental results on the three benchmark data sets show the superior performance of TSAE over state-of-the-art baselines. Extensive experiments to analyze model performance according to training parameters and a case study demonstrate effectiveness for the KG completion task.

2 Related work

In recent decades, translation distance models have made significant progress. Classical TransE [2] proposed the simplest embedding model, which tried to embed relational features r as a translation from the subject to object entity features while considering s \(+\) r \(\approx \) o as a triplet score. However, the translation distance creates an issue for complex relations categories [35]. TransH [34] gives a flexible solution by projecting the entity vectors in the relation-specific hyperplane. TransR [21] generalizes the approach of TransH by projecting entity vectors into different relation-specific spaces, improving model performance and increasing complexity. TransD [10] further separates the projection for the subject and object entities to obtain the projection with the same settings as TransH. The latest TorusE [6] extends the TransE principle precisely by projecting the entities and relations in a lie group called a torus. Hyun-Je Song enhances the translational model to learn new triplet entity embedding using existing learned embeddings as an incremental training process that minimizes the risk of parameter change [30]. DTransE [29] is the first-ever distributed training approach for TransE using a triplet partitioning algorithm to balance the resource workload by considering higher and lower-degree entities. TransX-pa [39] deals with entity circular and hierarchical structure using position-aware and self-attention block embedding to generalize the previous transnational model mechanism.

Another stream of embedding is the tensor factorization, first proposed in RESCAL [27] to capture the latent semantic in partially observed tensor X of size \(|E|\times |E|\times |R|\). This tensor represents the presence and absence of a relationship type between the entity pair of the triplets. The underlying structure of KG is learned by rank-d factorization, where each slice of X factorizes as s\(\textbf{R}_k\) o\(^\intercal \) to extract latent semantics. DistMult [36] simplifies the RECAL by acquiring the diagonal of \(\textbf{R}_k\) in a scoring function. Embedding captures the pairwise interaction among entities over the same dimension to reduce complexity. ComplEx [31] updates DistMult using a vector space for complex value features to improve the entity prediction performance. DURA [33] tackles the unrelated embedding for the entities with related semantics using duality-induced regularizer to ensure similar embedding for similar semantic entities.

Recent models adopted the convolution neural network. The first successful ConvE [5] model combines triplet subject entity and relation vectors in a 2D convolution layer that yields a feature map tensor, which then performs a dot product with the object vector to compute a score for a given triplet. ConvKB [26] combines triplet entities and relation vectors to a 2D convolution layer to increase learning efficiency. ConvR [12] captures relation-specific features from a given triplet entity vectors using relation-specific filters of 2D convolution. Graph2Seq [18] captures the subgraph structure using an information fusion approach to learn the entity embedding. A convolution layer of M-DCN [40] aggregates entities and relationship features, then extracts the interfaces between them based on a dynamically relevant filter.

Table 1 Statistics of data set

In addition, some recent models have adopted an attention mechanism for KG embedding. KGAE [25] introduces a generalized node attention mechanism for link prediction by iteratively propagating different assigned attention. HARN [20] framework propagates attention based on relation-path neighbors of entities aggregated by different aggregator functions. GGAE [17] combines the one-hop neighbors and multi-hop neighbors entity features to compute the entity attention and relation attention to learn the entity and relation embedding. LSA-GAT [11] considers cycle and star-type spatial component structures using neighborhood aggregation combined with a CNN-based decoder to calculate triplet scores. The authors of [16] propose a graph attention-based model that treats a KG as an irregular graph and integrates graph-structured information through multiple independent channels. AR-KGAN [41] jointly models triplets and logical rules in a unified framework consisting of an encoder with a neighborhood aggregator. That effectively combines association rules-based and graph-based attention weights to capture entity relationships. JSSKGE [19] uses a graph attention mechanism to learn the subgraph structure and combines the soft logical rule to rectify the learned embeddings. T-GAE [7] utilizes attention mechanisms to capture the importance of different entities within the temporal knowledge graph. It introduces timespan-aware embedding to model the evolving nature of entities and relations using the LSTM network. The aforementioned methods deal with the problem of KG incompleteness. However, KGs can also contain corrupted triplets due to acquisition noise and deliberate attacks. Dealing with corrupted KGs involves techniques such as adopting anti-attacks, data validation, regular audits, and anomaly detection [23].

All the models reviewed above are for completing KG through the entity prediction task. But in general, fewer relationship types are used to connect many entities, as shown in Table 1, which illustrates the KG statistics. Moreover, the relationship types are not uniformly distributed across the entities, affecting the fact inference. The attention-based models consider the several independent attentions computed by various transformation operations. Accordingly, it can be computationally expensive when applied directly to the large size KG. Furthermore, the model uses artificially generated false (negative) samples to compute the training loss between true and false triplets. However, negative sampling is likely to cause the problem of vanishing gradients during training [13]. Therefore, we present a simplified graph attention technique to generate fine-grained embeddings without the negative sampling requirement for missing facts inference in the knowledge graph.

3 Proposed methodology

3.1 Preliminary

The knowledge graph is a multi-relational structured network of real-world entities and relationships between a pair of entities connected by the directed edge. We define \(KG=\{E,R,T\}\), where E is a set of entities, R is a set of relations, and T denotes a set of observed fact triplets in the KG. Each triplet (\(e_i,r_k,e_j\)) consists of a relation \(r_k\in R\) between the entity pair \(e_i,e_j \in E\). Entity feature vector e\(_i\), e\(_j\in \mathbb {R}^d\), where d is the dimension size. Equation (1) creates the adjacency matrix A of size \(|E|\times |E|\) representing the graph structure. The subgraph consists of a central entity \(e_i\) and its surrounding neighbors. Let \(\mathcal {N}_i\) signify the 1-hop incoming directed edge neighbors of \(e_i\) including itself, as follows:

$$\begin{aligned} \varvec{A}_{ij}= & {} {\left\{ \begin{array}{ll} 1 &{} \text {if (}e_i,r_k,e_j\text {)} \in T \\ 0 &{} \text {otherwise} \end{array}\right. }\end{aligned}$$
(1)
$$\begin{aligned} \mathcal {N}_i= & {} (e_i,\forall e_j \in E \wedge A_{ij} = 1 \mid e_i) \end{aligned}$$
(2)

3.2 Problem formulation

In a supervised learning setup, the task of fact inference is formulated as an optimization problem to determine the missing relation between a given triplet entity pair. This is done by maximizing the likelihood of the true relation type using the updated attention-aware triplet entity features. Therefore, for the given pair of disjoint entities (\(e_1,e_2\)), model the probability distribution \(p(\forall r_k \in R|(e_1,e_2))\), such that the true relation type \(r_{pred}\) gains the highest probability, as expressed in the following equation:

$$\begin{aligned} r_{pred} = max(p(\forall r_k \in R\mid (e_1,e_2))) \end{aligned}$$
(3)
Fig. 2
figure 2

The design framework of the proposed TSAE model consists of a triplet subgraph to get neighborhood entities, an attention layer to generate embedding, and a Dense layer to compute the loss

3.3 TSAE architecture

The TSAE framework consists of three parts. The first part takes the input triplet and generates the triplet subgraph set of triplet entities and their neighboring entities. The attention layer processes the subgraph set entity features to deliver attention-aware triplet entity features. The dense layer concatenates the attention-aware triplet feature vectors and processes them through hidden layers. The output layer consists of neurons equal to the number of relation types. Finally, compute the categorical cross-entropy loss between the output layer result vector and the one-hot encoding vector of the respective triplet true relation type. A working illustration of the TSAE model for the input triplet (California (\(e_1\)), isLocatedIn (r), UnitedState (\(e_2\))) is shown in Fig. 2. The triplet subgraph creates a set \(E_{sub}=\{e_1,e_2,e_3,e_4,e_5\}\) consisting of triplet entity pair neighborhoods. The attention layer map entity feature vector \(\varvec{e}_i\in \mathbb {R}^d \) to \(\varvec{e}_i^* \in \mathbb {R}^{d^\prime }\) using \(W_T \in \mathbb {R}^{d\times d'}\), computes the attention value \(e_{ij}\) for \((e_1, e_2)\) and their neighborhood. Subsequently, update the triplet entity features using the aggregation function to obtain attention-aware features \(\varvec{e}^\prime _1\) and \(\varvec{e}^\prime _2\) using the normalized attention value \(\alpha _{ij}\). Then, the dense layer takes the updated entity feature vector \(\varvec{e}'_1 || \varvec{e}'_2\) as input and feeds it to the hidden layers to obtain the relation probability at the output layer. Finally, we calculate the training loss between the output layer vector \(\hat{\varvec{r}}\) and the one-hot encoding vector r.

Algorithm 1
figure d

TSAE training

Based on the above discussion, Algorithm 1 presents the training process of the TSAE model. We first transfer the triplet set T, entity set E, relation set R, and the adjacency matrix \(\varvec{A}\) for training to an algorithm with embedding and transformation dimensions d and \(d^\prime \), respectively. Xavier uniform approach is used to initialize the vector of dimension d for each entity \(e_i \in \mathbb {R}^d\) as follows:

$$\begin{aligned} \varvec{e}_i=uniform(-\frac{6}{\sqrt{d}},\frac{6}{\sqrt{d}} )\mid \forall e_i\in E \end{aligned}$$
(4)

The uniform approach samples each scalar parameter \(v_i\in \varvec{e}_i=[v_0,v_1,\dots ,v_d]\) at a time independently from the interval of uniform distribution. Thus, each feature value \(v_i\) has equal weightage to define the entity. We initialize each entity \(e \in E \) feature vector using the Xavier uniform method, each relation vector \(r\in R\) using one-hot encoding, and the transformation matrix \(\varvec{W}_T \in \mathbb {R}^{d\times d'}\)using the random uniform method, as shown in lines 1, 2, and 3 of an Algorithm 1, respectively.

3.4 Triplet subgraph

The first part of the model takes the input triplet \((e_1,r,e_2)\in T\) with an adjacency matrix \(\varvec{A}\) and generates a triplet subgraph set \(E_{sub}=(\mathcal {N}_1 \cup \mathcal {N}_2)\) consisting of triplet entities and their neighborhood entities. We store the triplet entity pair \(\{e_1, e_2\}\) into the \(T_{pair}\) for later use, as shown in the lines from 5 to 9 of an Algorithm 1.

3.5 Entity attention mechanism

All neighborhood entities contribute equally to the information sharing process of the Graph Convolution Network (GCN) [37], which treats all neighboring entities with equal prominence. In response to this shortcoming, the Graph Attention Network (GAT) allocates different levels of prominence to each neighboring entity. As shown in Fig. 2, we incorporated GAT with an updated approach to derive entity features by exploiting the different prominent levels of each component’s neighborhood using the attention mechanism. First, we map the entity \(e_i \in E_{sub}\) features from lower dimension d to higher dimension d’ feature space using a learnable transformation matrix \(W_T \in \mathbb {R}^{d \times d'}\) to achieve adequate expressive power as follows:

$$\begin{aligned} e_i^*=LeakyReLU(W_T*e_i) \end{aligned}$$
(5)

LeakyReLU restricts the extreme negative values of mapping. Lines 10 to 12 of an Algorithm 1 illustrates the mapping procedure of each entity from the triplet subgraph set according to (5). Attention value represents the comparative importance of each neighborhood entity features to the central target entity, including self-attention. Equation (6) expresses a single GAT layer to compute the attention value \(e_{ij}\) between each triplet entity and their neighborhood set \(\mathcal {N}_i\) and \(\mathcal {N}_j\). Each \(e_{ij}\) provides the importance of input edge from neighbor entities \(e_j \in \mathcal {N}_i\) to the central entity \(e_i\).

$$\begin{aligned} e_{ij} = \varvec{w}_a (e_i^* \parallel e_j^* )\mid \forall e_j^* \in \mathcal {N}_i \end{aligned}$$
(6)

where \(\varvec{w}_a \in \mathbb {R}^{2d'}\) is a learnable attention weight vector and \(\parallel \) concatenation operation, line 15 of an Algorithm 1 illustrates the above step according to (6). Equation (7) normalizes the attention values to obtain the standard scaling across all neighborhoods. The central entity can have nonuniform neighborhood entities in its subgraph structure. Finally, aggregate all normalized attention values with the respective \(e_j^*\) to update the features of the triplet entity as follows:

$$\begin{aligned} \alpha _{ij}= & {} softmax(e_{ij})=\frac{\exp (e_{ij})}{\sum _{e_{ik}\in \mathcal {N}_i}\exp (e_{ik})}\end{aligned}$$
(7)
$$\begin{aligned} e'_i= & {} \sigma \Big (\sum _{j\in \mathcal {N}_i}\alpha _{ij}e^*_j\Big ) \end{aligned}$$
(8)

Line 16 of an Algorithm 1 shows the summation of attention value, and lines 18 to 21 illustrate the normalization and summation of normalized attention value according to (7). Accordingly, line 22 shows the aggregation function according to (8).

3.6 Dense neural network layer

The attention layer supplies the attention-aware triplet entity vectors (\(e'_1,e'_2\)) to the next layer. As shown in the last part of Fig. 2 and illustrated in line 24 of the TSAE algorithm, the dense layer (feedforward neural network) combines the triplet entity feature vectors and processes them through the hidden layers to compute the predicted relation score distribution \(\varvec{\hat{r}}\). Intentionally, we fixed the number (size) of neurons in each hidden layer as follows to design a generalized dense layer:

$$\begin{aligned} size(H_{i+1}) = \lceil size(H_i) - ((2d'-|R|)/(|H|+1))\rceil \end{aligned}$$
(9)

where |H| represents the total number of hidden layers and \(H_i\) signifies the specific hidden layer. However, the number of hidden layers depends on the model designer. The last output layer consists of neurons equal to the number of relation types in the KG, ie \(size(H_{out})=|R|\). The categorical cross-entropy loss between the target relation one-hot encoding vector \(\varvec{r}\) (the true class distribution) and the output vector \(H_{out}\) (\(\varvec{\hat{r}}\) the predicted relation type score distribution) is calculated as follows:

$$\begin{aligned} \mathcal {L}(\Omega ) = - \sum _{i=1}^{|R|}\varvec{r}_i \cdot \log (\varvec{\hat{r}}_i) \end{aligned}$$
(10)

Line 15 of an Algorithm 1 illustrates the loss function according to (10), and line 26 shows the parameters updating step according to the gradient of loss.

Fig. 3
figure 3

Functional structure of fact inference using TSAE

3.7 Fact inference

Figure 3 illustrates the functional structure of the proposed fact inference model as a relation prediction between the given disjoint entity pair “ElonMusk-?-California.” The entity Elon Musk and California are given as a subject (\(e_1\)) and object (\(e_2\)) entity of a missing triplet. The triplet subgraph creates a set \(E_{sub}=(\mathcal {N}_1 \cup \mathcal {N}_2)\) consisting of triplet entity pair neighborhoods from the sample adjacency matrix \(\varvec{A}\) of entity \(e_1\) and \(e_2\). The attention layer delivers the updated triplet entity pair feature vectors \(e'_1\) and \(e'_2\) using (8). The dense layer computes the probability distribution of all the relation types. TSAE infers the true relation type \(r_{pred}\) as the highest scored (probability) relation for the given disjoined entity pair (\(e_1,e_2\)) using Algorithm 2. We provide the test set \(\mathcal {T}\), attention-ware entity set \(\varvec{e}'\), and dense layer weight matrices (\(\varvec{W}_1,\varvec{W}_2,\varvec{W}_3\) ) as input to the algorithm. The algorithm predicts the relationship and its ranks for each test triplet to measure the performance of the model. Initially, the predicted relation set, true relation set, and their rank set are initiated to the empty set, as shown in line 1 of an Algorithm 2. Algorithm line 3 obtains the triplet entity pair (\(e_1-e_2\) considered as a disjointed entity pair), then line 4 stores its true relation (\(r_k\)) in the TR set. The learned entity pair feature vectors (\(\varvec{e}'_1,\varvec{e}'_2\)) obtained in lines 5 and 6. The dense layer processes the concatenated entity feature vectors (\(\varvec{e}'_1 || \varvec{e}'_2\)) using the respective weight matrix in algorithm line 7, which computes the probability (score) of all the relation types in vector \(\varvec{\hat{r}} \in \mathbb {R}^{|R|}\). Algorithm lines 8 and 9 are dedicated to storing and sorting the score of each relation type in the intermediate relation score list RL (the relation-wise score of each \(r\in R\)). Subsequently, line 10 collects the highest score relation in the predicted relation set PR, and line 11 collects their rank (index of a relation from RL). Steps from lines 3 to 11 are iterated for each triplet belonging to the test set.

Algorithm 2
figure e

Relation prediction.

4 Experimental setup

4.1 Data set

This study considers three benchmark data sets well-known for KG completion tasks. Table 1 describes the statistical information about the dataset and data split. WN18-RR is a subset of WN18 [2] derived from WordNet, the standard lexical knowledge graph for the English language. Likewise, FB15k-237 is a subset of FB15k [2] derived from Freebase, an online collection of multi-relational structured data about real-world facts. The WN18 and FB15k datasets have problems with test leakage due to the inverse relationship between the entity pairs, which were later released with the new benchmark WN18RR and FB15k-237 datasets by removing reversible relationships [5]. The YAGO3-10 dataset derived from the YAGO3 [24] contains millions of triplets belonging to people, organizations, and cities, with 37 different types of relationships.

Table 2 Relation prediction results in MRR and Hit@N on WN18RR, FB15k-237, and YAGO3-10 data set

4.2 Evaluation metrics

We evaluate the proposed models on the relation prediction task by ranking the score (likelihood) of all the true relation type \(r\in R\) for the given triplet entity pair (\(e_1,e_2\in E\)) from the test set \(\mathcal {T}\). We use the standard ranking metrics Mean Reciprocal Rank (MRR) and Hit@N

$$\begin{aligned} MRR= & {} \frac{1}{\mathcal {|T|}}\sum _{(e_1, r_k, e_2)\in \mathcal {T}} \frac{1}{RR_{r_k}}\end{aligned}$$
(11)
$$\begin{aligned} Hit@N= & {} \frac{1}{\mathcal {|T|}}\sum _{(e_1, r_k, e_2)\in \mathcal {T}}(RR_{r_k} \le N) \end{aligned}$$
(12)

where \(\mathcal {|T|}\) Indicates triplets count in the test set, and \(RR_{r_k}\) is the true relation rank for the given triplet entity pair (\(e_1,e_2\)). MRR is the average reciprocal rank, and Hit@N is the average rank within the N cut-off for each true relation type \(r_k\). However, we use 1 and 5 cut-offs in the validation and test set. MRR and Hit@N are computed in the range of 0 to 1. The higher MRR and Hit@N indicate superior performance.

4.3 Baselines

TransE [2] learns the entity features using the translation distance between the subject and relation features equivalent to the object features. TransH [34] simplifies the polysemy problem of TransE using the projection of an entity in the relationship-specific hyperplane. DistMult [36] reduces the complexity by taking the diagonal of the semantic tensor factor in a scoring function that captures the pairwise interaction among the entities. ComplEx [31] extends the DistMult using complex-valued embedding for entities and relations. ConvE [5], provides the framework to learn entity features using 2D convolution over subject and relation features, then add object entity features to score the triplet. ConvKB [26] learns entity features by concatenating entity pair and relation features for 2D convolution, which has relevance with TransE. DMACM [9] incorporated the directional self-attention network combined with a convolutional layer as an encoder to derive the entity features. GGAE [17] combines the one-hop neighbors and multi-hop neighbors entity features to compute the entity attention.

4.4 Implementation details

For reasonable evaluation, we initialized a feature vector of size 250 using Xavier’s method, set the training epoch to 1000 with a batch size of 512, and optimized the training process using stochastic gradient descent with a learning rate of 0.0001 for all baselines and TSAE. In the TSAE transformation matrix, \(\varvec{W}_T\) size is set to \(250\times 500\), LeackyReLU with a negative slope of 0.2, dropout rate 0.5, kernel regularizer, and attention kernel regularizer set to \(\textit{L}_1\) norm. The feedforward neural network consists of fixed input layer \(size(H_{inp}) = 1000\) and output layer \(size(H_{out}) = |R|\) with two hidden layers of \(size(H_i) \in [673, 346]\) set according to (9) for the WN18RR. However, the network consists of 3 hidden layers of \(size(H_i) \in [810, 619, 428]\) and \(size(H_i) \in [760, 5190, 278]\) for the FB15k-237 and YAGO3-10 respectively. The LeakyReLU activates each hidden layer neuron with a negative slope set to 0.2 and the dropout set to 0.3, and a sigmoid is applied to the output layer. Pairwise margin loss is used to compute the loss during training all the baselines with \(L_2\) regularization, a margin set to 5. Moreover, the negative samples are generated by using [13]. ConvKB trains with 100 filters of size \(1\times 3\) initialized by the truncated normal distribution. The baseline models are mainly designed for entity prediction (\(?-r-o \text { or } s-r-?\)) for given entity relation pair (\(s, r \text { or }r, o\)). We redesign the triplet scoring function of baselines to accomplish the task of relation prediction (\(s-?-o\)). We use a PyTorch [28] to design the TSAE and PyKEEN [1] to obtain the results of baseline models with their basic settings.

Table 3 Complex relation categorization of WN18RR data set (M represents Many, values in bracket represent relation frequency in the test set)

5 Results and discussions

5.1 Model evaluation

This section presents the experimental result of TSAE and baselines in the MRR and Hit@N. The underline indicates the second highest, and the bold case indicates the outperforming results. Table 2 illustrates the experimental results of all the models on three data sets. Translational models operate on triplet-level proximity structure and do not consider graph structure, which results in unsatisfactory performance on all three data sets. Tensor factorization models generate the relation-specific tensor in which each 2D slice captures the relation-specific graph structure. Thus, both tensor factorization models outperform the translational model and are nearly close to the convolutional model. Among them, ComplEx performs better on the WN18RR and YAGO3-10.

The ConvE and ConvKB concatenate relation and entity pair features using the convolution layers and then try to mimic the translation model at the dense layer through activation functions. ConvE performs well on FB15k-237 but performs relatively poor on lower node degree data sets WN18RR and YAGO3-10. Attention-based models DMCM and GGAE perform better than translational, factorization based, and convolutional models. Specifically, GGAE scored the most Hit@5 on WIN18RR. However, TSAE captures the triplet subgraph level proximity, showing superior performance across all data sets. Specifically, it achieves more than 2% relative improvement in MRR and Hit@1 on the WN18RR. Likewise, TSAE achieves 2% relative improvement in MRR and Hit@1 on the FB15k-237 dataset. Furthermore, TSAE achieves 2.1% relative improvement in Hit@1 on the YAGO3-10 dataset.

We tested the performance of TSAE on the triplets belonging to the complex relation categories present in the KG, as suggested by [2]. Complex relation categories are generated by finding the average number of entities present in each relation subject (s) side and/or object (o) side. Table 3 shows complex relation categorization of WN18RR consists of 11 types of relationships. Table 4 presents the category-wise average relation prediction result in Hit@1, in which TSAE achieves superior performance, including 2%, 2.1%, and 2.8% relative improvements in 1-to-Many, Many-to-1 and Many-to-Many relation category over the baseline GGAE respectively. In contrast, GGAE outperformed TSAE in the 1-to-Many relation category.

Fig. 4
figure 4

Heatmap of actual versus predicted relations for each triplet from a test set of WN18RR

The WN18RR test set consists of 3134 triplets and relation frequencies stated in Table 3. Relation prediction Algorithm 2 returns the true triplets set (TR) and predicted relation set (PR) utilized to generate the confusion matrix. Figure 4 illustrates the relation prediction heatmap on the test set triplets. Diagonal elements of the heatmap indicate the predicted true relation, where TSAE achieves superior correct relation type prediction from lowest 75% (129 out of 172) for “\(r_4\_has\_part\)” to highest 100% (3 out of 3) for “\(r_1\_similar\_to\)”. However, false relationship-type predictions belong mainly to the same complex relation categories, as shown in Fig. 4.

Table 4 Complex relation category-wise results of relation prediction in Hit@1 on WN18RR data set
Fig. 5
figure 5

Convergence of TSAE and baselines during training on the WN18RR dataset recorded for (a) Loss and (b) MRR

We recorded the MRR and the loss during the training process after every five epochs for all the models on the validation set. We have stored the best model during the training process on the MRR criteria, which is then utilized to evaluate the model on the test set. Figure 5a shows experimental results of model loss versus epochs. TSAE achieves the best outcome in 600 epochs. Other models take more epochs to update entity features for the best results. TSAE achieves an MRR of up to 0.85 in the first 355 epochs, as shown in Fig. 5b, then saturates with minor performance changes. Attention mechanisms improve relational prediction performance by providing additional support for entity feature learning.

Fig. 6
figure 6

Features vector dimension d effects on the TSAE performance in (a) MRR and (b) Hit@1

5.2 Model analysis

This part presents the impact of model design parameters on TSAE performance in Hit@1 as a primary metric for relation prediction. TSAE utilizes the summation operation in the aggregation function to update the features of the entity, given in (8). Furthermore, we conducted experiments with average and max aggregation functions in TSAE to investigate the significance of other aggregation functions, given in (13) and (14), respectively. The results of these aggregation functions on three data sets are shown in Table 5. Consistent performance of TSAE (avg) and TSAE (sum) can be seen on the WN18RR dataset. Likewise, a similar performance can be seen on FB15kk-237. Also, TSAE (avg) and TSAE (max) performance are similar on YAGO3-10, but TSAE (sum) achieves a 1% relative improvement.

$$\begin{aligned} \varvec{e}'_i= & {} \sigma \Big (\frac{1}{|\mathcal {N}_i|} \sum _{j \in \mathcal {N}_i} \alpha _{ij}e^*_j \Big )\end{aligned}$$
(13)
$$\begin{aligned} \varvec{e}'_i= & {} \sigma \Big (max( \alpha _{ij}e^*_j \mid j \in \mathcal {N}_i)\Big ) \end{aligned}$$
(14)
Table 5 Effect of aggregation function on TSAE in Hit@1

The features vector dimension plays a vital role in KG embedding. To analyze the effect of the vector dimensions, we performed several experiments with dimensions \(d \in [50, 75, 100, 150, 200, 250]\) on the three data sets. Figure 6 shows the MRR and Hit@1 bar charts to demonstrate the effect of vector size on TSAE. It can be seen that the performance of TSAE in MRR and Hit@1 improves with increasing dimension size. However, TSAE performs satisfactorily at \(d = 250\) on all data sets. Due to resource limitations, we have experimented with dimensions up to 250, but performance can be further improved for larger sizes.

The transformation matrix \(W_T \in R^{d\times d^\prime }\) provides sufficient expressive power to TSAE. Consequently, to investigate the importance of the transformation matrix on TSAE performance, we conducted some experiments with different matrix dimensions. Taking into account previous investigations, we fixed \(d = 250\) and chose \(d' \in [300, 400, 500, 600, 700, 800]\) for experiments. The general results are plotted in Fig. 7, which shows the effect of \(d^\prime \) on Hit@1 over the three data sets. It can be found that TSAE achieves superior performance over the baseline up to 500 dimensions. Later, performance saturates at higher values from 600 to 700, slightly improving at 800.

Fig. 7
figure 7

Transformation matrix dimension \(250\times d'\) effects on the TSAE performance in Hit@1

The dense layer computes the relation likelihood for the given triplet entity pair. However, the number of hidden layers gradually affect the learning model prediction performance. Subsequently, we perform the experiments with \(H_i \in [1, 2, 3, 4]\) for the fixed input layer \(size(H_{inp}) = 1000\) (concatenates \(\varvec{e}'_1,\varvec{e}'_2 \in \mathbb {R}^{500}\)) and output layer \(size(H_{out}) = |R|\) (number of existing relations). Table 6 shows the TSAE sensitivity over the number of hidden layers and their size is calculated using (9). TSAE with two hidden layers on the WN18RR gives superior results, afterwards Hit@1 is saturated with barely small improvement up to four layers. TSAE performs decently with two hidden layers but achieves 2% and 3% relative improvement with three hidden layers on the FB15k-237 and YAGO3-10 data sets.

Table 6 Hidden layer size effects on the TSAE performance in Hit@1 on WN18RR data set
Table 7 Case study on relation prediction in Hit@5 on YAGO3-10 data set

5.3 Case study

To demonstrate the effectiveness of TSAE for completing knowledge graphs, we present a case study on the YAGO3-10 test set. Table 7 shows the relation prediction results in Hit@5 for six entity pairs from the test set. The TSAE results for each pair of entities are arranged in descending order of the predicted score. The bold case indicates the ground-truth relationship type between given entity pairs. The TSAE predicts the true relationship type isConnectedTo at a first rank (higher score) for the first entity pair. Similarly, identical results can be observed for the next three entity pairs. However, TSAE predicts lower scores for true relationships between given fifth and sixth triplet entity pairs due to less frequent entities in the training set, such as Moscow and Vinay_Pathak.

6 Conclusion

In this paper, we propose TSAE, a novel embedding model for enhancing missing facts inference in knowledge graphs. TSAE combines the simplest form of an attention mechanism with a neural network. The attention layer computes the attention value between the entity pair and their neighborhood. Subsequently, it updates the triplet entity feature by aggregating neighborhood attention features. The neural network concatenates and processes the attention-aware triplet entity features to compute the likelihood of relation types. Experimental results have shown that TSAE performs better than baselines in standard metrics. Furthermore, we investigate the model sensitivity to training parameters. A case study illustrates the effectiveness and efficiency of TSAE in completing knowledge graphs. TSAE is practical for applications where reasonableness, simplicity, and interpretability are important.

KGs are often sparse, and TSAE might struggle with less frequent (long-tail) entities because they tend to receive less attention than frequent entities. It needs to be investigated by incorporating adaptive attention mechanisms that give more weight to less frequent entities. TSAE can be prone to overfitting, especially when dealing with noisy or corrupted KGs. Future research can focus on regularization techniques, such as dropout or weight decay to mitigate overfitting in attention-based models. Many real-world KGs are dynamic, and the information changes over time. TSAE might not naturally handle temporal aspects. It is necessary to explore ways to incorporate temporal information by combining attention mechanisms with other techniques, such as adversarial learning, reinforcement learning, and transfer learning.