1 Introduction

Go along with the development of Internet, we are witnessing the tremendous growth of multidisciplinary information resources in the pattern of knowledge graphs (KG) [1, 9, 17, 32], such as Wikipedia, WordNet, YAGO, Freebase and DBpedia recently. In fact, KG plays an important role as indispensable auxiliary expert knowledge sources for constructing AI-based systems [4, 8]. A KG is a multi-relational graph [5] which is composed by large number of facts which can be denoted as triples in the form of <head relation tail>, e.g., <Donald_Trump, President_Of, USA>, <Elon_Musk, Founder_Of, SpaceX>. In recent years, many organizations and researchers have much concentrated on how to intensively extract latent features in given KGs by preserving and learning their structures. This approach is called KG embedding or KG representation learning. KG embedding [33, 37] has quickly gained massive attentions due to their wide applications in different domains, such as information retrieval [11], QA chatbot [2, 4] and recommendation [34]. In general, KG embedding can be used to compress high-dimensional complex data structure of entities and their associated relations of KGs into fixed low-dimensional and continuous data structure in the pattern of vectors. Then, these transformed vectors which are represented for entities and their relations will be used for multiple tasks, such as similar entities searching, entities clustering/classification and relation extraction. Moreover, KG embedding can also be used to achieve new unknown facts which do not exist in current KGs yet, such as relations/links predictions between unconnected entities or entity (head/tail) predictions with given relations, commonly called KG completion. Relevant entities searching/querying [1, 3, 18] is considered as a primitive task for most of common applications of KG embedding. Recently, similar search in KG encounters many challenges due to the complexity and ambiguity of user’s queries. For example, in realistic application such as QA chatbot [4], we usually encounter complex similar entity-based queries like as “Which are places in Paris that are similar to the Louvre museum?,” “What are similar places of Vọng_Cảnh hill in Huế?” The reasonable outputs for this type of query are more complicated than just finding top-k similar embedded entities which are closest to “place/ museum:Louvre” and “place/hill:Vọng_Cảnh” entities in a given KG. In fact, there are multiple searching criteria must be fulfilled before returning the searching results to the end-users, e.g., the top returned results must be place-/museum-typed entities within the Paris city of France for the first query. A common technique for solving these complex searching tasks is to modelling given queries as meta-path-based patterns. A meta-path is a symmetric sequential order of entities and relations which indicates specific semantic meaning of interconnections between KG’s entities. Back to previous example of finding similar entities of “Louvre,” we can formulate this query as a meta-path, as [place/museum]\(\mathop \to \limits^{{{\text{containedInPlace}}}}\)[city]\(\mathop \leftarrow \limits^{{{\text{containedInPlace}}}}\) [place/museum] (as shown in Fig. 1a). Similar to the first example, the second query also can be modelled as a meta-path [14], as [place]\(\mathop \to \limits^{{{\text{containedInPlace}}}}\)[city]\(\mathop \leftarrow \limits^{{{\text{containedInPlace}}}}\) [place] (Fig. 1-B).

Fig. 1
figure 1

Illustrations of modelling user’s queries as meta-path-based patterns

However, the second query can be considered as complicated than previous one due to the ambiguity in multiple user’s searching purposes. “Vọng_Cảnh” is a hill which is a common place for enjoying sightseeing by tourists, so similar entities of “Vọng_Cảnh” hill must be places which are suitable for enjoying sightseeing. In this example, “Ngự_Bình” mountain is considered as a top candidate for queries like this: “Which place is similar to Ngự_Bình?” The best answers for this type of query must satisfy two aspects: the place must be in “Huế” (same as “Ngự_Bình”) and should be a mountain or a hill (“Ngự_Bình” is a mountain). Therefore, thorough evaluation of entity’s concept/description is also necessary while computing the similarity between two entities in KGs.

1.1 Problem definition

Definition 1

Knowledge Graph (KG) as Heterogeneous Information Network (HIN) is a directed labelled graph, denoted as \({{G}} = {{V}},{{E}},\phi ,{{\psi }}\), where

  • \(V\) stands for a set of entities/nodes in the given KG.

  • \(E\) stands for a set of relations/links between entities/nodes in given KG. These relations might be binary (1 for existing relation and otherwise 0) or weighted relations.

  • \(\phi\) and \(\psi\) are two mapping functions, where

    1. o

      Node’s type mapping function: \(\phi \left( V \right) \mapsto {\mathcal{A}}\)—with \(V = \left\{ {v_{1} ,v_{2} , \ldots ,v_{n} } \right\}\), a specific node (\(v\)) belongs to a specific type: \(a\),\(a \in {\mathcal{A}}\), we have: \(\phi \left( v \right) = a\).

    2. p

      Edge’s type mapping function: \(\psi \left( E \right) \mapsto {\mathcal{R}},\psi \left( e \right) \in {\mathcal{R}}\)—with \(E = \left\{ {e_{1} ,e_{2} , \ldots ,e_{m} } \right\}\), a specific edge (\(e\)) belongs to a specific type: \(\mathcal{r},\mathcal{r}\in \mathcal{R}\), we have: \(\psi \left( e \right) = r\).

  • Traditionally, a knowledge graph is normally d as a set of triple notations, denoted as \(h,p,t\), where\(h\), \(t\) and \(p\) present for the head, tail objects and the predicate/link, respectively. A RDF triple is considered as a direct relation between two entities, e.g., h:Hà Nộ\({\text{i}} \xrightarrow{{{\text{r: capital}}\_{\text{of}}}} t\)Việt_Nam. In previous studies, KG (also known as ontology) is specified as RDF (Resource Description Framework) graph which contains RDF triples, as\(\left\{ {h,p,t\,} \right\}\). These RDF triples are used to model direct relationships between entities in a given KG. However, this traditional representation of KG is unable to sufficiently model real-world KGs as heterogeneous networks which contain multi-typed entities and relations.

Definition 2

Network Representation Learning (NRL) [15]: given an information network, denoted as \(G\, = \,\left( {V,\,E} \right)\), where (\(V\)) and (\(E\)) present for sets of network’s nodes and edges, respectively. The ultimate goal of a NRL model is to find a mapping function, denoted as f for transforming the given set of network’s nodes into d-dimensional vectors, as \(f:\,V\, \to \,\mathbb{R}^{{\left| V \right| \times d}}\).

Definition 3

Similarity search via NRL approach: depending on a specific similarity searching purpose, NRL model is defined to capture specific relevant information between network’s nodes in order to, respectively, map these features into similar vector spaces. Therefore, similar nodes with common distinctive features will be represented as similar vectors. To meet a specific relevant node search task, the mapping function (\(f\)) is designed accordingly for capture desired latent features of given network’s nodes.

Similarity search is considered as the most common problem of knowledge graph (KG) mining [14, 18]. Similarity search in KG supports to find most relevant entities with the given user’s queries. To solve similar search task in KG, knowledge graph embedding is considered the subarea of network representation learning (NRL) (Definition 2) which is recently a most well-known technique which supports to preserve and represent the structure of KG (entities and relations) into low-dimensional vectors [2, 8, 9, 17]. Then, we can simply measure the relevance entities by calculating the distance between their vectors. This approach is called similarity search over NRL approach (Definition 3). From the past, most of KG embedding models is considered as homogeneous embedding approach which considers all KG’s entities and relations as the same type. However, in practical implementation, the ambiguity in user’s queries which are represented as sequential relations within the complex structure of KG, also considered as heterogeneous information networks (HIN) [5] (Definition 1) with multi-typed entities and relations. In the past, to effectively achieve the representation of entities and their associated relations in a given KG, many embedding methods have been proposed recently. The most common approach for KG embedding is the distance translation-based approach with the most well-known Trans-family models (TransE [33], TransH [37], TransR [11]). The famous distance translation-based TransE [5] model is aimed to embed entities and relations in KG into the same fixed \(\left| d \right|\)-dimensional continuous latent space, denoted as \(R^{{\left| V \right| \times d}}\), where \(\left| V \right|\) is number of entities in a given KG. In the translation-based approach, TransE is designed to exploit the translation from head entity, (denoted as a vector \(\varvec{h}\)) to tail entity (\(\varvec{t}\)) regarding with their associated relation (\(\varvec{r}\)) within a specific fact. The model is trained to achieve the objective that: \(\varvec{h} + \varvec{r} \approx \varvec{t}\) (as illustrated in Fig. 2-A). In next improvement for resolving multiple relations between same head/tail entities, TransH is proposed the relation-specific hyper-plane projection mechanism for differentiating the roles of same entities with different relations in given facts (as illustrated in Fig. 2b). Similar to TransH, the TransE model employs the relation-specific spaces (instead of hyper-plane-based projection in TransH model) to separate same head/tail entities in different facts according to their corresponding relation. However, distance translation-based embedding techniques only focused on the direct relations/triples (which are occurred in facts) between entities rather than paths. Therefore, these distance translation-based KG embedding techniques are unable to handle complex querying tasks which are required evaluations on indirect interconnections between entities. Table 1 presents common notations which are used in our paper.

Fig. 2
figure 2

Illustrations of TransE and TransH KG embedding models

Table 1 List of notations which are used in this paper

1.2 Existing challenges and motivations

In realistic requirements for knowledge extraction from KG, only using direct relations/triples to learn the representation of entities is insufficient. Due to the ignore of evaluating on paths/sequential relations between entities in KG, the representation output is unable to use for complex querying task. Recently, there are multiple studies which are focused on exploiting the sequential relations/paths between entities to leverage the knowledge representation outputs, such as PTransE [11] and RPE [2]. These models consider path-specific evaluation while learning the representation of entities which enable to solve complex querying task in KGs. However, recent proposed models of Lin et al. (PTransE and RPE) still paid less attention on the sequential order as well as relation’s type within paths between entities in KGs. In fact, different orders of paths between entities might carry out different semantic meanings. For example, different paths between two entities “France” and “\({\text{Eiffel}}\_{\text{Tower}}\)” carry out different meaning like as France \(\mathop \to \limits^{{{\text{Contain}}}}\) Paris \(\mathop \to \limits^{{{\text{Contain}}}}\) Eiffel_Tower and France \(\mathop \to \limits^{{{\text{Capital}}}}\) Paris \(\mathop \to \limits^{{{\text{Has}}}}\) Eiffel_Tower. Therefore, different semantic paths between same-typed entities should be embedded as different vectors.

Moreover, most of traditional KG embedding techniques only focus on the structural information of knowledge graph (relations between entities) and ignore the textual information which are tightly associated with entities. In fact, plain text which associated with entities in KG can help to provides abundant value information as we as support for entity and relation disambiguation while learning the representation of the given KGs. It is undeniable that textual data could play as a supplement for leveraging knowledge graph embedding task with both structural and contextual aspects. Recently, the joint of textual information and structure representation learning in KG has gained a lot of interests from researchers with multiple proposals [15, 32]. Recent researches [18, 35] focused on the combinations of textual information with structural information of KG to improve the representation outputs. However, joined text-based KG embedding models are considered as lack of thorough evaluations the sequential relations between entities in KG.

1.3 Our contributions

To fully incorporate between textual information and KG’s structure in representation learning task in this paper, we propose a novel approach of text-enhanced meta-path-based embedding model, called W-KG2Vec. To properly capture the rich-semantic structure of given KGs, we apply the meta-path-based random walk mechanism to generate contextual entities for each given entity via different defined meta-paths which is inspired from our previous works [24,25,26,27]. Our principal assumption of applying meta-path-based guided representation learning in KG is same-typed entities which are interconnected via defined specific paths must be transformed into similar vectors in the given KG embedding space (as illustrated in Fig. 3a). Moreover, the random walk is guided by the transitional weight of text-based similarity between entities. The textual similarity measures between entities are identified by applying the collaborative self-attention of BERT [15] pre-trained model and sequential encoding to effectively learn the representation of textual data.

Fig. 3
figure 3

Illustrations of KG embedding strategies of proposed W-KG2Vec model

In this paper, we apply BERT pre-trained model with the bidirectional LSTM encoder to achieve the embedding of textual descriptions of entities in the given KGs. Then, these representations of descriptions are used to compute the text-based similarity between entities (as illustrated in Fig. 3b). Then, these computed text-based similarity scores are used to guide the meta-path-based random walk mechanism. The jointly representation learning of both textual information and KG’s structure via meta-path-based random walk is promising to improve the quality of KG representation learning output. The main difference between our proposed model with other KG embedding model is the capability of capturing both semantic and local structural latent features of entities in the given KG to effectively fulfill the similarity search task. To sum up, our main contributions in this paper can be summarized as the following:

  • The introduction of novel combination of BERT pre-trained model with Bi-LSTM encoder to support for learning the sequential representation of textual descriptions which are associated with entities in given KGs, called BERT-Text2Vec.

  • The application of meta-path-based random walk mechanism in proposed W-KG2Vec model for generating contextual entities for each target entity in KG via defined meta-paths. Meta-path-based walks on KG are guided by the textual similarity weight between entities which are calculated by Bert-Text2Vec. Then, the extracted contextual entities are used to train the KG representation learning model.

  • Extensive experiments on benchmark datasets with complex similar entity searching tasks demonstrate the effectiveness of our proposed model in comparing with recent state-of-the-art baselines.

In Fig. 4, we present an overall architecture of our proposed W-KG2Vec model. The rest of this paper has four main sections. In the second section, we review the related and discussing about the advantages/disadvantages of recent KG embedding techniques. In the third section, we briefly introduce about the approach of Bert-Text2Vec and W-KG2Vec models for text-enhanced meta-path-based KG embedding approach. Next, in Sect. 4, we present the extensive experiments and comparative studies on performance of proposed W-KG2Vec with recent KG embedding techniques. Finally, we conclude our works and present future improvement in Sect. 5.

Fig. 4
figure 4

Overall architecture of our proposed W-KG2Vec model for KG embedding task

2 Related works and motivations

In recent years, the use of KG for supporting AI-based systems has growth quickly. KG embedding has been proved to benefit multiple tasks such as information retrieval, question-answering system and relation extraction in different knowledge domains. KG embedding is designed to transform multi-typed connected data, in form of entities and their relations into a continuous fixed low-dimensional vector space. There are several popular KG embedding techniques which can been categorized as two main groups, as the following.

2.1 Translational distance KG embedding approach

In the translational distance approach, proposed embedding techniques mainly depend on the structural information of the KGs, specifically the directed relationship between entities in form of triple: \(\left\langle {h,\,r,\,t} \right\rangle\) in KGs. The most traditional well-known KG embedding technique is TransE [1] model. TransE is considered as a simple and effective method which support to learn the vector representations of both entities in relations in a given KG. TransE depends on a basic idea that a relation between head and tail entities are supposed to correspond to a distance translation between the representations of two given entities, denoted as \(\varvec{h} + \varvec{r} \approx \varvec{t}\). We also know about the Unstructured Model (UM) [3] is an earlier version of TransE with the elimination of relations between entities in training the embedding model and the structured embedding (SE) [5] apply matrix projections to differentiate relations between same pairwise entities in KGs. However, the TransE model is only capable to present 1-to-1 relation between target entities which leads to the failure in translating 1-to-N, N-to-1 and N-to-N relations. Therefore, several improvements such as TransH, TransR and TransA have been proposed to overcome this problem by applying the relation transformation into different hyper-planes/subspaces. Beside Trans-family models, there are some proposals which can be considered as belong to translation-based KG embedding approach such as Gaussian-based KG embedding techniques (KG2E [13], TransG [38]) which mainly depend on multivariate Gaussian distributions for learning the representation of entities and relations. Similar to the approach of SE model, RESCAL [22] is considered as a bilinear-based model which represents relations between entities in KG as matrices. However, most of translational distance KG embedding techniques are considered as less-informative embedding approach due to the ignore of textual data, such as descriptions and concepts of entities while learning the representations. Textual information which is associated with KG’s entities is now taken in consideration while training the KG embedding model in order to improve the output quality of entities and relations. This text-enhanced KG embedding trend has led to the proposals of some improved models recently. Recently, the proposed ConvE [31] is designed to perform a global 2-D convolution operation on the subject entity and relation embedding vectors. These embedding vectors are reshaped as the matrices and then concatenated. Similar to previous approach, the proposed RotateE [39] leverages the KG embedding model with multiple relations/entities by defining each relation as a rotation in the complex embedding spaces. However, these recent approaches are all considered as link/triple-based embedding approach without considering text-based semantic similarity between entities.

2.2 Jointly text-enhanced KG embedding approach

In recent times, researchers have gained much interests on jointly learning the representation of both structural information and textual information of KGs. There are multiple proposed techniques applying embedded textual data to leverage the quality of entities representations in KGs such as the use of average word representation [16, 31] in entities’ names for identifying similar entities while training the KG embedding model. Inspiring from previous work [31] on jointly text-enhanced KG embedding, the next improvements [27, 39] proposed an extended embedding approach for learning the representation of entities’ textual descriptions which help to enrich the embedding quality of entities in KG. In fact, in previous, the textual information and KG’s structure embedding are learnt separately with different objective functions which leads to the sparsity of KG’s entity representation. To overcome the separation in structure-based and text-based representations in KG embedding, a proposal [24] of using convolutional neural network (CNN) architecture, which enable to utilize both structure-based and text-based embedding aspects, can be called as J-CNN for short. Recently, with successes of auto-encoding approach, such as GPT [36] and BERT [27], in natural textual data processing, several studies have been adopted these techniques to leverage the performance of KG embedding task in the rich-textual context, such as KG-BERT [42] and K-BERT [41]. Through successes of previous jointly text-enhanced models in leveraging the overall output of KG embedding task, the textual information is proved as effective way for improving the quality of entity representations. However, for the textual data, mainly descriptions of given entities in KGs are existed in form of long-text documents separately learning the embedding of each word in these documents (continuous bag-of-words approach) might cause the information lost due to the sequential complex of textual data which are composed in natural language form.

3 Methodology

In this section, we introduce our proposed W-KG2Vec model for text-enhanced knowledge graph embedding by applying the jointly learning of textual descriptions of entities in KGs and meta-path-based random walk for generating contextual entities of each source entity with defined semantic sequential relations in form of meta-paths.

3.1 Preliminaries and definitions

Formulating a KG as a heterogeneous information network (HIN) (Definition 1), we denote \({\mathcal{A}}\) and \({\mathcal{R}}\) stand for the set of entity/node types and relation/link types, respectively. In context of HIN, a knowledge graph is considered as a directed labelled graph with \(V\) is a set of multi-typed entities which are connected by a set of multi-typed relations, denoted as \(E\).

A KG is considered as rich-semantic if it has large number of entity types and relation types such as YAGO or Freebase contain thousands of entity’s types and relation’s types. In order to have an overview about the complexity of a KG, we need to look at its network schema KGNS (Definition 4). In fact, most of real-world KGs such as YAGO, Freebase and DBpedia have complex structure with number of entity types and relation types might be up to thousands. These KGs are considered rich-schematic KGs with complicated KGNS. For rich-schematic KGs, KGNS is really necessary for understanding possible occurred direct relation types between multiple entity types as well as defining semantic paths between pairwise entities. With the KGNS of a given KG, we can easily identify set of direct relation types can be occurred between two entity types as well as interconnected paths/sequence of relations. In a KG, two entities might be connected by not only direct relations but also via indirect sequential relations which carry out rich-semantic meanings.

Definition 4

Knowledge Graph Network Schema (KGNS): For a given KG, denoted as \(G = V,E,\phi ,\psi\) a KGNS is formally defined as a tuple, with: \({\mathcal{A}}_{G} ,{\mathcal{R}}_{G} ,{\mathcal{E}}_{G} ,{\mathcal{P}}_{G}\) where \({\mathcal{A}}_{G}\),\({\mathcal{A}}_{G} = \bigcup \nolimits_{{v \in V}} \phi \left( v \right)\) and \({\mathcal{R}}_{G},\,{\mathcal{R}}_{G}=\bigcup_{e \in E}\psi (e)\) are entity types and relation types which appear in the given KG (\(\mathrm{G}\)).\({\mathcal{E}}_{G}\) and \({\mathcal{P}}_{G}\) present for sets of direct relations and indirect relations in form of meta-paths (Definition 5) between entities in a given KG (\(\mathrm{G}\))

Definition 5

Meta-path \(({\mathbf{\mathcal{P}}})\) [30]: It is defined as sequential relations between two entity types, normally a meta-path is defined as symmetric with same source and target entity type. A meta-path with (\(l\))-length is defined in form of \({\mathcal{P}} = {\mathcal{A}}_{1} \mathop \to \limits^{{{\mathcal{R}}_{1} }} {\mathcal{A}}_{2} \mathop \to \limits^{{{\mathcal{R}}_{2} }} \ldots \mathop \to \limits^{{{\mathcal{R}}_{l} }} {\mathcal{A}}_{{l + 1}}\), where \({\mathcal{A}}_{1} ,{\mathcal{A}}_{2} \ldots {\mathcal{A}}_{{l + 1}} \in {\mathcal{A}}\) and \({\mathcal{R}}_{1} ,{\mathcal{R}}_{2} \ldots {\mathcal{R}}_{{l + 1}} \in {\mathcal{R}}\) are entity types and relation types which are occurred in given meta-path \({\mathcal{P}}\), respectively.

In a KG, the indirect sequential relations/paths between entities which are written as \(v_{1} \mathop \to \limits^{{e_{1} }} v_{2} \ldots \mathop \to \limits^{{e_{l} }} v_{{l + 1}}\). These indirect sequential paths connect two entities can be formulated as “meta-paths” (Definition 5). Beside the aspect of carrying rich-semantic meanings of relations between entities, the number of possible meta-paths also corresponds to distinctive features. In order words, existing meta-paths carry the real-world structural complexity of a given KG. There are some rich-schematic KGs, such as YAGO and Freebase, which the number of meta-paths which are possibly defined and might be much larger than simple-schematic KGs, such as DBLP and MovieLens. Between two same-typed entities, we might have multiple meta-paths with different length. In approach of heterogeneous network analysis and mining tasks, such as similarity search task, given meta-paths between entities are mostly defined by users in order to achieve different outputs. Or in other words, meta-paths are patterns of depending on querying purposes of the users. By applying user-specified meta-paths in KG embedding, we can flexibly utilize the entity representation outputs following the needs of retrieval tasks. In our W-KG2Vec model, instead of using all directed relation between entities in all KG’s triples, the user-specified meta-paths will be used to train the KG’s entity representation model.

In this paper, we propose a meta-path-based KG embedding technique which contains two modules. The first module is in charged for learning the representation of entities’ textual descriptions by applying collaborative self-attention of BERT to learn the sentence-level representation then combining with LSTM encoder to produce the final representations of given entities’ textual descriptions. This module is called BERT-Text2Vec. Next, for each entity in the given KG, we used the meta-path-based random walk mechanism to generate set of contextual entities which will be used for embedding model training process. The random walks on each entity will be controlled by the calculated similarity weight of that entity with its neighbors. Finally, the representation model is optimized by applying heterogeneous negative sampling with SGD.

3.2 BERT-Text2Vec: sequential textual data representation learning approach

The main goal of proposed BERT-Text2Vec in this paper is to learn the textual description representation of entities in a given KG. The textual description of each entity can provide supplementary information for entity’s concept disambiguation (e.g., “\({\text{JFK}}_{\rm AirPort}\)” with “\({\text{JFK}}_{\rm Person}\),” “\({\text{blackberry}}_{\rm company}\)” with “\({\text{blackberry}}_{\rm fruit}\)”) as well as text-based similarity evaluation (e.g., “\({\text{Bengal}}\_{\text{tiger}}_{\rm animal}\)” with “\({\text{Sumatran}}\_{\text{tiger}}_{\rm animal}\),” “\({\text{Paris}}_{{{\text{Location}}\_{\text{City}}}}\)” with “\({\text{Lyon}}_{{{\text{Location}}\_{\text{City}}}}\)” etc.). In fact, textual descriptions of entities in KGs are in form of long-text documents with multiple long sentences. Unfortunately, recently BERT pre-trained models have not yet been fine-tuned for long-text documents (larger than 512 words/tokens), so we need to propose a new approach for learning the representation of textual descriptions of entities in KGs, called BERT-Text2Vec. The BERT-Text2Vec is a combination of BERT pre-trained model with bidirectional LSTM encoder to fulfill long-text representation learning task.

3.2.1 Sentence representation learning with BERT pre-trained model

At first, we split each textual description (\(d\)) of entities in KGs into multiple (\(n\)) sentences, denoted as \(s\), \(d = \left\{ {s_{1} ,s_{2} , \ldots ,s_{n} } \right\}\). Then, we apply the BERT pre-trained model to learn the representation of each word/token in the given sentence. Assuming that, for a sentence, we have list (\(m\)), (with \(m < 512\)) tokenized words (\(w\)), denoted as \(s = \left\{ {w_{1} ,w_{2} , \ldots ,w_{m} } \right\}\). We apply BERT pre-trained (\({\text{BERT}}_{\rm base}\)) model to learn the representation of each word in each sentence. For the original BERT pre-trained model contains 12 hidden layers, and 768 hidden units. We will use these output hidden units as the embedding vectors for words in each sentence. Each word is now represented as 768-dimensional vector, denoted as \(\left\{ {\overrightarrow {{w_{1} }} ,\overrightarrow {{w_{2} }} , \ldots ,\overrightarrow {{w_{m} }} } \right\} \in \mathbb{R}^{{768}}\). Then, we apply the Bi-LSTM architecture with global average pool to form representation of given sentence, denoted as \({{\vec{s}}}\). We construct the Bi-LSTM with the different parameters (\(\theta _{\rm forward} ,\theta _{\rm backward}\)) to reflect the asymmetry of sentence processing. After that the hidden states of both forward (\(\vec{h}\)) and backward (\(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}\)) processes are concatenated and apply global average pool to form the final representation of a given sentence. The overall sentence representation learning can be described as following, (see Eq. 1):

$$\begin{aligned} \vec{h} & = {\text{LSTM}}(\overrightarrow {{w_{1} }} ,\overrightarrow {{w_{2} }} \ldots \overrightarrow {{w_{m} }} |\theta _{\rm forward}) \\ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} & = {\text{LSTM}}(\overrightarrow {{w_{m} }} ,\overrightarrow {{w_{{m - 1}} }} \ldots \overrightarrow {{w_{1} }} |\theta _{\rm backward} ) \\ \vec{s} & = {\text{AvgPoo}}l\left( {\left[ {\vec{h};\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} } \right]} \right) \\ \end{aligned}$$
(1)

where

  • \(w\) is 768-dimensional vector which represents for each word/token in a given sentence (\(s\)) which is achieve from BERT.

  • \(\vec{s}\), is d-dimensional vector which represents for the given sentence, where \(d\) is number of LSTM cells which is used.

  • \(\vec{h}\) and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}\) are hidden states of forward and backward processes, respectively.

Our objective of applying BERT for extracting word embedding in each sentence and forming the sentence embedding by applying Bi-LSTM technique is to capture the implicit discourse relations between words in each sentence. Given (\(d_{s}\)) is the initial size of embedding vector for sentence which is also number setup LSTM cells in forward and backward flows, the inputs of Bi-LSTM model are set of 768-dimensional embedded vectors which are represented for words in the given sentence (\(s\)).

Taking 768-dimensional embedded vector of each word (\(\varvec{w}_{i}\)), the Bi-LSTM is used to learn the sequential orders of words’ representations in both forward and backward processes. Finally, we concatenate hidden forward and backward states and apply \({\text{AvgPool}}\) to form a final |\(d_{s}\)|-dimensional representation of given sentence (\(s\)). Through our experimental studies, the use of average pooling can help to achieve better performance than other vector combination strategy such as max pool or min pooling. Through careful evaluations of textual document representation via Bi-LSTM encoder, we figured out that the output latent hidden vectors are quite synthetic, and to softly aligned and combine the latent representations of these two hidden state vectors into a final document representation vector, average pooling strategy is considered as a suitable strategy. We conducted extra experiments to compare the performance of max, min and average pooling strategies on the overall W-KG2Vec model performance in Sect. 4.3.2.

The overall process of our sentence representation learning strategy is described in Fig. 5. Our approach of sentence representation learning by utilizing the sequential encoding mechanism of Bi-LSTM is inspired from previous approaches [7, 23]. However, a major difference of our sentence representation learning strategy in comparing with previous models is the application of BERT pre-trained model for achieving the bidirectional word embedding.

Fig. 5
figure 5

Overall strategy of sentence representation learning of proposed BERT-Text2Vec model by applying BERT pre-trained model with Bi-LSTM encoder

3.2.2 LSTM encoder for long-text document model

In order to capture sequential representation of sentence in each description of KG’s entity, denoted as \(\vec{d}\), we propose a technique of performing the textual embedding exchange between sentences through the state transition of recurrent neural network (RNN) architecture, resulting in a sequence of sentence states. From the sentence representations which are achieved by taking average output layer of BERT embedding, we apply the LSTM encoder to learn the final representation of given textual descriptions of KG’s entities. Taking each sentence as each time-step input for LSTM encoder, the gated state transition operation for the hidden state \(h_{j}\) of (\(j\))-th sentence, denoted as \(s_{j}\) can be defined as the following (see Eq. 2):

$$\begin{gathered} i_{j} = \sigma \left( {W_{i} \overrightarrow {{s_{j} }} + \,U_{i} h_{j} + b_{i} } \right) \hfill \\ o_{j} = \sigma \left( {W_{o} \overrightarrow {{s_{j} }} + \,U_{o} h_{j} + b_{o} } \right) \hfill \\ f_{j} = \sigma \left( {W_{f} \overrightarrow {{s_{j} }} + \,U_{f} h_{{j - 1}} + b_{f} } \right) \hfill \\ u_{j} = \sigma \left( {W_{u} \overrightarrow {{s_{j} }} + \,U_{u} h_{i} + b_{u} } \right) \hfill \\ c_{j} = i_{j} \odot u_{j} + f_{j} \odot c_{{j - 1}} \hfill \\ h_{j} = o_{j} \odot \tanh \left( {c_{j} } \right) \hfill \\ \end{gathered}$$
(2)

where

  • \(i_{j}\), \(o_{j}\) and \(f_{j}\) present for input, output and forget gates, respectively.

  • \(W\), \(U\) and \(b\) are model parameters which are optimized during the training process.

Each embedded sentence will be passed to the given LSTM encoder and update the model the state transition process. In particular, the state transition between embedded vectors also consists state transition for each sentence within the given textual description of entity. In fact, these state transitions carry out information exchange between a sentence with all previous sentences that are composed within an entity’s description. Finally, we take the last output hidden state of given LSTM encoder as the final representation for entities’ descriptions in a given KG. The size of embedded document vectors equal to number of setup gated LSMT cells.

In fact, textual data are frequently considered as a complex structure data where traditional sequential representation learning method such as GRU/Bi-GRU seems unfordable to fully capture latent features of textual document. In fact, the use of GRU/Bi-GRU can help to reduce the training computational effort and time-consuming due to the smaller number of model’s parameters. However, our studies in this paper are majorly focused on how to improve the accuracy performance of KG embedding for similarity search task, therefore we designed to use the LSTM as the main textual sequential encoder in our approach. In Sect. 4.3.2, we also present experiments for comparing the difference in model’s accuracy performance between the uses of Bi-GRU and Bi-LSTM.

3.3 W-KG2Vec: text-enhanced meta-path-based KG embedding approach

3.3.1 Text-based similarity weight between entities

From the representations of textual descriptions of entities which have been learnt in previous steps, we apply to cosine similarity to compute the text-based similarity between entities in a given KG. For any given source entity (\(v_{s}\)) and target entity (\(v_{t}\)), the text-based similarity weight, denoted as \({w}_{{v}_{s}\rightsquigarrow {v}_{t}}\), is calculated by following equation (see Eq. 3):

$$w_{{v_{s} \rightsquigarrow v_{t} }} = \frac{{\overrightarrow {{{\rm d}v_{s} }} .\overrightarrow {{{\rm d}v_{t} }} }}{{\overrightarrow {{{\rm d}v_{s} }} .\overrightarrow {{{\rm d}v_{t} }} }}$$
(3)

where

  • \({w}_{{v}_{s}\rightsquigarrow {v}_{t}}\), is text-based similarity weight between source entity (\(v_{s}\)) and target entity (\(v_{t}\)).

  • \(\overrightarrow {{{\text{d}}v_{s} }}\) and \(\overrightarrow {{{\text{d}}v_{t} }}\) are textual description representations of source entity (\(v_{s}\)) and target entity (\(v_{t}\)) in a given KG, respectively.

3.3.2 Meta-path-based random walk on KG

After learning the representation of textual descriptions in given KGs, we apply the meta-path-based random walk to generate contextual entities for each given entity. Given a KG, denoted as \(G = V,E,\phi ,\psi\), with a defined meta-path (\({\mathcal{P}}\)). For any starting entity in KG, denoted as \(v_{s}\) and next target node: \({\text{v}}_{{\text{t}}}\), where \({\text{v}}_{{\text{s}}}\) and \({\text{v}}_{{\text{t}}}\) is in same type, \(\phi \left( {v_{s} } \right) = \phi \left( {v_{t} } \right)\), the transitional probability between source (\({\text{v}}_{{\text{s}}}\)) and target (\({\text{v}}_{{\text{t}}}\)) entities, following the given meta-path (\({\mathcal{P}}\)) which is denoted as \({\mathrm{\pi }}_{{\mathrm{v}}_{\mathrm{s}}\rightsquigarrow {\mathrm{v}}_{\mathrm{t}}},\mathcal{P}\). This transitional probability is formulated by the following equation (see Eq. 4):

$$\pi _{{v_{s} \rightsquigarrow v_{t} }} ,{\mathcal{P}}\, = \left\{ {\begin{array}{*{20}l} {{\text{with\,}}e\left( {v_{s} ,v_{t} } \right) \notin E\left\{ {\begin{array}{*{20}l} {\frac{{\mathop \sum \nolimits_{{{\mathcal{P}}_{{v_{s} \rightsquigarrow v_{t} }} }} \mathop \sum \nolimits_{{i,i \in E_{{\mathcal{P}}} \left( {s\rightsquigarrow t} \right)}} \frac{1}{{\left| {N\left( {v_{i} } \right)} \right|}} + w_{{v_{s} \rightsquigarrow v_{t} }} }}{\lambda },{\text{\,with}}\,\,\phi \left( {v_{s} } \right) = \phi \left( {v_{t} } \right)\left( {4a} \right)} \\ {\,0,\,{\text{with}}:\,\phi \left( {v_{s} } \right) \ne \phi \left( {v_{t} } \right)\,\left( {4b} \right)} \\ \end{array} } \right.} \\ {\frac{1}{{\left| {N\left( {v_{s} } \right)} \right|}},{\text{with}}\,e\left( {v_{s} ,v_{t} } \right) \notin E_{{\mathcal{P}}} \left( {4c} \right)\,\,} \\ \end{array} } \right.$$
(4)

where

  • \(\sum _{\mathrm{i},\mathrm{i}\in {\mathrm{E}}_{\mathcal{P}}\left(\mathrm{s}\rightsquigarrow \mathrm{t}\right)}\frac{1}{|\mathrm{N}\left({\mathrm{v}}_{\mathrm{i}}\right)|}\), is the sum of transitional probabilities for meta-path-based walker to travel through all entities (\(v_{i}\), with \(O\left( {v_{i} } \right)\) is out-degree neighbors of \(v_{i}\)) (within given meta-path \({\mathcal{P}}\)) between entity (\(s\)) and entity (\(t\)).

  • \({w}_{{v}_{\mathrm{s}}\rightsquigarrow {v}_{t}}\), is the text-based similarity weight between source entity (\({\text{v}}_{{\text{s}}}\)) and target entity (\(v_{t}\)) which is calculated by Eq. 3.

  • \(\lambda\), the global normalizing constant which help to normalize the value of transitional probability (\(\pi\)) within range [0, 1]. Normally, the value of normalizing constant is calculated for each meta-path-based walk by taking total transitional probability of all walks via different path instances of each meta-path.

Semantic-aware meta-path-based random walk (RW) over KG Our proposed defined random walk mechanism which are adopted in our previous works [14, 15] is mainly designed for generating contextual entities of each KG’s entity which are then used for the network representation learning process. Our proposed meta-path-based RW mechanism contains two main types of walks, which are same-typed walk and different-typed walk. Let’s take a meta-path: P(place)-C(city)-Ci(country)-C-P in YAGO knowledge as an example in this case (as illustrated in Fig. 6). For walks between [place]-[city] and [city]-[country] entities, we randomly select a next neighborhood node with a distribution of \(\frac{1}{{\left| {N\left( {v_{s} } \right)} \right|}}\) (as shown in Eq. 4c). At the end of each path instance following the given meta-path, \({\mathcal{P}}\), we need to re-calculate all transitional probability between two target-typed entities (in this case is place) to identify the next target-typed entity for the next move of our walk via selecting the target-typed node with the maximum transitional probability (as identified by Eq. 4a) (Figs. 7, 8, 9, 10). The overall semantic-aware meta-path-based RW are controlled by predefined walk’s length (\({\text{l}}\)) and number of walks per node (\({\text{w}}\)) which are applied in previous studies [22, 29].

Fig. 6
figure 6

Illustration of meta-path-based random walk mechanism in W-KG2Vec model for generating contextual entities for each entity in a given KG

Fig. 7
figure 7

Similar entity search task with different KG embedding techniques in YAGO-small dataset

Fig. 8
figure 8

Similar entity search task with different KG embedding techniques in Freebase-small dataset

Fig. 9
figure 9

Similar entity search task with different KG embedding techniques in YAGO-large dataset

Fig. 10
figure 10

Similar entity search task with different KG embedding techniques in Freebase-large dataset

Advantages of applying meta-path-based random walk on KG embedding For traditional approach of KG embedding, all relations between entities (in forms of triples) will be scanned and taken in consideration during the embedding process. This is considered as time-consuming task and needs more computing resources for the large-scaled KG representation learning process. On the other hand, random walk is considered a computational efficient approach for large-scaled KGs in terms of both computing resource and time requirements. The complexity of storing immediate same-typed neighbors of each entity is about \(O\left( {\left| E \right|} \right)\). For meta-path-based random walk, it is useful to store the meta-path-based interconnections between next same-typed neighbors of every entity with: \(O\left( {\alpha ^{2} \left| E \right|} \right)\), where \({{\alpha }}\) is average out-degree of all entities in the given KG. For each entity, with the (\(k\)) number of contextual samples which are needed for each entity, we can choose a longer walk-length value, (\(k\)) with: \(l > \,k\), which only needs effective computing complexity about: \(O\left( {\frac{l}{{k\left( {l - k} \right)}}} \right)\) for each contextual sample. After the meta-path-based random walk process in a given KG, we will obtain a set of contextual entities for each KG’s entity in form of \(\left\{ {v,c_{t} } \right\}\), with \(c_{t}\) presents for a set of same-typed (t) entities of a given entity (v). Then, similar to the approach of previous heterogeneous network representation learning approach such as Node2Vec [22] and Metapath2Vec [29], we applied the Skip-gram architecture of the well-known Word2Vec [28] model to generate training set for our proposed W-KG2Vec model. Specifically, considering a KG as a heterogeneous network with different-typed entities, our proposed KG embedding model is designed to learn embedding of different-typed entities over multi-typed generated entities via meta-paths, therefore we adopted the previous heterogeneous Skip-gram approach of Dong et al. in Metapath2Vec [29] to facilitate the heterogeneity of KG representation learning process.

Application of heterogeneous Skip-gram architecture From sets of generated contextual entities of each entity in the given KGs which are extracted by the meta-path-based random walk mechanism. We apply the heterogeneous skip-gram sampling technique to learn the representation of entity. In order to learn the representation of entities in the given KG, the model is aimed to maximize the probability of having a set of same-typed contextual entities, denoted as (c) for specific given entity (v) as following equation (see Eq. 5):

$$\mathop {\arg \max }\limits_{\theta } \mathop \sum \limits_{{v \in V}} \mathop \sum \limits_{{t \in T_{V} }} \mathop \sum \limits_{{c_{t} \in N_{t} \left( v \right)}} {\text{Prob}}\left( {c_{t} |v;\theta } \right)$$
(5)

where

  • \(N\left( v \right)\) and \(N_{t} \left( v \right)\), are the set of neighborhood entities of (\(v\)) and set of neighborhood entities (v) with t-th type, respectively.

  • \({\text{Prob}}(c_{t} |v;\theta )\), is the conditional probability of having context entities (\({\text{c}}\)) which belong to \({\text{t}}\)-th type with given entity (\({\text{v}}\)).

The given probability of having a set of contextual entities (\({c}_{t}\)) goes along with target same-typed entity (\(v\)) is normally defined as a softmax function, with:\(\mathrm{Prob}\left({c}_{t}|v;\theta \right)=\frac{{e}^{{X}_{{c}_{t}}\times {X}_{v}}}{\sum _{u\in V,\phi \left(v\right)=\phi ({c}_{t})}{e}^{{X}_{{c}_{t}}\times {X}_{v}}}\), where \({X}_{{c}_{t}}\) and \({X}_{v}\) are the row embedding vector of the given entity (\(v\)) and contextual entities (\({c}_{t}\)), respectively. Then, the sampling distributions of contextual entities over each given entity are formulated by the given objective function (as shown in Eq. 6):

$${\mathcal{O}}_{{c_{t} ,v_{i} }} = log\sigma \left( {X_{{c_{t} }} \times X_{v} } \right) + \,\mathop \sum \limits_{{k = 1}}^{K} log\sigma \left( { - X_{{u_{t}^{k} }} \times X_{v} } \right)$$
(6)

where

  • \({X}_{{c}_{t}}\) and \({X}_{{u}_{t}^{k}}\), stand for the matrix rows of contextual entities (\({c}_{t}\)) and set of negative sample entities (\({u}_{t}^{k}\)), respectively.

  • \({u}_{t}^{k}\), is defined as the \(k\)-th negative node which is sampled for context entities (\({c}_{t}\)), in heterogeneous sampling approach the sampling entities: (\({u}_{t}^{k}\)) and (\({c}_{t}\)) are in the same type, or: (\(\phi \left({u}_{t}^{k}\right)=\phi ({c}_{t})\)).

Finally, the overall model’s parameters are estimated by applying stochastic gradient descent (SGD) with gradients are updated by the following: \(X_{v} = X_{v} - \eta \frac{{\partial {\mathcal{O}}_{{c_{t} ,v_{i} }} }}{{\partial X_{v} }};X_{{u_{t}^{k} }} = X_{{u_{t}^{k} }} - \eta \frac{{\partial {\mathcal{O}}_{{c_{t} ,v_{i} }} }}{{\partial X_{{u_{t}^{k} }} }}\) with \(\eta\) is the setup learning rate. In more details, at the beginning, our model will iterate through all entities in a given KG to generated the corresponding contextual entities (\({c}_{t}\)) for each target entity via our proposed semantic-aware meta-path-based random walk. Next, we applied the Skip-gram and negative sampling technique to optimize the probability of occurring same-typed contextual entities (\({c}_{t}\)) for each target entity (\(v\)) (as shown in Eq. 5). Then, we applied the defined learning objective function (Eq. 6) to retrieve the representations of entities in a given KG with SGD.

3.3.3 Challenges of optimal meta-path’s length and ambiguity in KG embedding

Normally, in KG as HIN-based embedding task, we might encounter challenges related to the long-length representation of semantic relations between entities in user’ queries. The complexity of user’s queries along with existing KG’s relations might lead to common problems of infinite meta-path’s length selection as well as ambiguity in the semantic representation between entities. Considering KG as a heterogeneous network with different-typed relations between entities, such as relations between common entities like as person (“Emmanuel_Macron,” “Donald Trump”) and location (“France,” “USA”) and there is no clue for which relation is important than the others to select for appropriated meta-path forming. In fact, most of recent HIN is rich in schematic with hundreds of relation’s types which leads to difficulties in selecting proper relations for meta-paths which are used in the embedding process. Moreover, similar relations between entities also lead to ambiguity in different formed meta-paths which carried out different meanings and only some of them can be suitable for answering a specific user’s query. If the length of formed meta-paths is too long, it might lead to problems related to the time-consuming of overall embedding process. Currently, to prevent these problems, we combined a previous approach of Changping M. et al. [22] for automatically discovering potential meta-paths between specific entity’s types in KG with human-based knowledge expert to select proper meta-paths which is considered as a semi-supervised technique to obtain potential meta-paths for fulfilling user’s queries. For the practical implementation of W-KG2Vec model, automatic discovered meta-paths between all types of entities will be showed for users to select which are the targeted semantic relations could be suitable for their queries.

4 Experiments and discussions

In this section, we conduct thorough experiments to demonstrate the effectiveness of our proposed W-KG2Vec model. Two well-known benchmark datasets are used in our experiments, including Freebase and YAGO. We implement W-KG2Vec with recent state-of-the-art KG embedding models for solving problem of similarity search task in KGs. The extensive comparative studies of proposed W-KG2Vec model with well-known KG embedding baselines show the effectiveness and scalability of our W-KG2Vec model performance in solving content-rich KG embedding task.

4.1 Dataset usage

To evaluate the performance of W-KG2Vec model with different KG embedding baselines, we use two main standard datasets, which are YAGO-{small, large} and Freebase-{small, large}. In these two KGs, we collect main entity types which are used for similar locations/places searching task, including (see Tables 2 and 3):

Table 2 Selected entity and relation types for similar locations/places searching task in YAGO and Freebase knowledge graph
Table 3 Number of extracted entities and relations which are used in experiments

For experiments, we used to main datasets and each dataset has two different versions, as small and large. The main purposes of using different sizes of each KG in order to evaluate the influence of KG’s size on the accuracy performance of each KG embedding model. As shown in Table 3, the number of extracted entities and relations for Freebase is quite smaller than YAGO. For the larger dataset of YAGO, we mainly use for extensive evaluations the scalability comparison between our proposed W-KG2Vec and other state-of-the-art KG embedding models.

4.2 Experimental setups

For textual description of each entity, we collected from multiple Internet resources, mainly from Wikipedia, DBpedia (content of “dbo:abstract” and “dbo:comments” fields). For W-KG2Vec model, we apply the BERT-Text2Vec model to learn the representation of textual descriptions, then these representations will be used for computing the text-based similarity weight \({w}_{{v}_{s}\rightsquigarrow {v}_{t}}\) in next processing steps. The numbers of vector’s size for sentence and full-text description representation (number of LSTM cells) are both established as 128 for all experiments.

For locations/places in each knowledge graph, we intuitively labeled the level of similarity depending on the 12 tourist purpose aspects, which are: “amusement park,” “beach,” “historical,” “lake,” “market,” “mountain,” “museum,” “national parks,” “pagoda,” “street,” “temple” and “villages.” One tourist location/place might be matched with multiple aspects, such as “Stonehenge” (Great Britain) can be labelled as {“historical/prehistoric,” “mountain”}, “Hội_An” (Vietnam): {“village,” “historical”, “temple”}, or “Disneyland” (USA): {“amusement park,” “museum”}, etc. Depending on labelled set of matching tourist aspects of each place we will score the similarity level each two pairwise locations/places as following (see Table 4):

Table 4 Scores for similarity level of two entities

The range of similarity scores as shown in Table 4 is adopted from previous studies of network’s node similarity search task [12, 16, 22]. On two given KGs (YAGO and Freebase), we conducted experiments on similar location/place-typed entities searching with different KG embedding models. The embedding vectors which represented for entities in two KGs are used to calculate the similarity scores (via cosine similarity) between location-/place-typed entities in queries and other same-typed entities in KGs. To evaluate the results of similar entities searching task in the given KGs, we use the nDCG (normalized Discounted Cumulative Gain) metric [19]. The average nDCG@10 (top-10 returned entities), nDCG@15, nDCG@20 and nDCG@30 of 100 queries for random 100 entities in each KG will be taken as the final results for comparisons. The use of sliding [k] range value from 10 to 30 in this paper is majorly inherited from previous network’s node similarity search studies which follows the searching behavior of user in common search engine such as Google where the returned results in the first three pages (10 results for each page) are mainly focused. For the W-KG2Vec model, we implemented an experimental environment with the following model’s configurations:

For other KG embedding baselines, we applied the golden configurations of each model from their original published works, such as STransE [30], PTransE [34], and RPE [6] (as shown in Table 5). To learn the representation of given KGs, we apply multiple meta-paths (as shown in Table 6) to capture the semantic meanings of interconnected relations of similar location/place-typed entities. We used the default configurations of Word2Vec [20] and Node2Vec [10] models for our neural network-based training process with learning rate is 0.025, and number of training epochs is about 300 for all datasets. For comparative studies with recent KG embedding techniques, we also implemented directed triple/relation-based KG embedding (including TransE [1], TransH, TransR, STransE [21]) and path-based KG embedding (including PTransE and RPE) techniques, the joint textual deep learning-based technique such as J-CNN [12] and BERT-based KG embedding models: KG-BERT [40] and K-BERT [19] for solving the same KG embedding tasks in the same datasets (YAGO and Freebase). For direct relation-based KG embedding techniques (Trans-family models and STransE), we apply the direct triples in two given datasets for training the entities representations. With the path-based KG embedding models (PTransE and RPE), we use the same meta-paths which are used in the W-KG2Vec model as the main training paths—\(\mathrm{p}=({\mathrm{r}}_{1},{\mathrm{r}}_{2}\dots {\mathrm{r}}_{\mathrm{l}})\)- between entities.

Table 5 Configurations for other KG embedding baselines
Table 6 Used meta-paths for KGs representation learning via W-KG2Vec model

4.3 Experimental results & discussions

4.3.1 Similar entity searching on KG

For similar location-/place-typed entities searching task in both YAGO and Freebase, we randomly picked up location-/place-typed 100 entities and conducted the similarity searches. The returned entities for each query are sorted by top-10, top-15, top-20 and top-30 depending on the similarity weights between entities which calculated by cosine similarity on the embedded vectors of given pairwise entities. Then, these top-k returned entities were evaluated and ranked by level of relevance (see Table 4) before applying the nDCG metric to calculate the query accuracy score. Finally, the average accuracy scores of 100 queries were taken as the final results for each embedding technique. Tables 7 and 8 show the average top-k nDCG accuracy results for similar location-/place-typed entities searching task with different embedding models in YAGO and Freebase KGs, respectively.

Table 7 Average nDCG@k accuracy for different embedding techniques on YAGO dataset
Table 8 Average nDCG@k accuracy for different embedding techniques on Freebase dataset

For the small version of both two datasets, the experimental results show the outperformance of our proposed W-KG2Vec model (averagely 70.47% in YAGO and 79.49% in Freebase) in comparing with other state-of-the-art KG embedding techniques. In general, path-based methods gain better performance about 10.18% in term of nDCG metric than direct triple-based KG embedding techniques. The results implicitly indicate the advantages of applying paths in better capturing sematic meanings of relations between entities during the KG embedding task. In fact, W-KG2Vec model achieves the better accuracy performance in Freebase which is considered as a smaller KG (< 1 M entities) than YAGO (> 3.8 M entities). In YAGO dataset, W-KG2Vec reasonably outperforms about 16.08% in comparing with direct triple-based KG embedding methods (TransE—18.09%, TransH—18.11%, TransR—15.29% and STransE—12.83%) and average 3.65% in comparing with path-based KG embedding methods (PTransE—5.21% and RPE—2.09%) and J-CNN (3.92%). For experiments on Freebase dataset, our proposed W-KG2Vec model also slightly improves the accuracy in term of nDCG with both direct triple-based (about 9.09%) and path-based KG (2.13%) embedding techniques and 2.4% for J-CNN technique.

Differently with large version of two datasets, where the number of entities is more than 2 times the small versions, the experimental outputs demonstrate a significant improvement of our proposed W-KG2Vec model in comparing with previous KG embedding models (as shown in Tables 7 and 8). In more details, the W-KG2Vec model outperforms Trans-family models (TransE, TransH and TransE) averagely about 21.76%, STransE (15%), PTransE (12.1%) and RPE (10.56%). Experimental outputs in the large version of two dataset demonstrate the effectiveness of our proposed model which can effectively capture richer semantic of relations between entities in context of large-scaled KGs.

Furthermore, in comparing with recent BERT-based KG embedding approaches, such as KG-BERT and K-BERT, our proposed W-KG2Vec also slightly outperforms averagely 1.31% (K-BERT)—3.98%(KG-BERT) and 3.25% (K-BERT)—4.97% (KG-BERT) in YAGO and Freebase datasets, respectively.

4.3.2 Experimental studies on representation learning approaches for KG similarity search task

The combination of textual and meta-path-based representation learning for KG embedding In this section, we demonstrate experiments related to the comparison between the use of sequential textual representation learning with the combined text-enhanced meta-path-based approach of W-KG2Vec model. Our proposed W-KG2Vec model is a combination between two modules, which are BERT-Text2Vec and the meta-path-based network embedding (MP2Vec) approach which is majorly inspired from the previous Metapath2Vec model [30]. Figures 11 and 12 demonstrate the separated performance evaluations of each embedding module in comparing with the completed W-KG2Vec model for the KG’s entity similarity search task in YAGO-large and Freebase-large datasets.

Fig. 11
figure 11

Comparative studies between W-KG2Vec model with separated embedding modules (BERT-Text2Vec and MP2Vec) in YAGO-large dataset

Fig. 12
figure 12

Comparative studies between W-KG2Vec model with separated embedding modules (BERT-Text2Vec and MP2Vec) in Freebase-large dataset

Fig. 13
figure 13

Comparative studies between different types of sequential textual embedding techniques for W-KG2Vec model in YAGO-large dataset

Fig. 14
figure 14

Comparative studies between different types of sequential textual embedding techniques for W-KG2Vec model in Freebase-large dataset

As shown from the experiments, the combination between two embedding modules (BERT-Text2Vec and MP2Vec) of our proposed W-KG2Vec model can achieve better performance than separated usage of each embedding approach in the KG’s entity similarity searching task.

Comparison on the uses of GRU/Bi-GRU and LSTM/Bi-LSTM for sequential textual encoding In Sect. 3.2, we present the use of LSTM/Bi-LSTM sequential textual encoder to learn representations of words/sentences in each textual document which is associated with each KG’s entity. However, there is a question of which type of sequential textual encoder can suitable for our proposed model in order to each the highest accuracy performance in similarity search task. We implemented our W-KG2Vec with two types of sequential textual encoders which are W-KG2Vec-LSTM and W-KG2Vec-GRU.

Experimental outputs (as shown in Figs. 11 and 12) demonstrate the use of LSTM can achieve better performance than GRU approximately 18.45% in terms of nDCG@k metric for both YAGO and Freebase dataset. This output proves that LSTM could be the best sequential textual encoder for our proposed W-KG2Vec model (Figs. 13, 14).

Comparison on the uses of different vector combination strategies In this section, we study the influence of different vector pooling strategies, which are denoted as max pooling, min pooling and average pooling on the overall W-KG2Vec model accuracy performance. Three versions of W-KG2Vec model are implemented corresponding to different pool strategies (W-KG2Vec-AvgPool, W-KG2Vec-MaxPool and W-KG2Vec-MinPool) to demonstrate the differences of model’s accuracy performance in similar entity search task in both YAGO and Freebase. Figures 15 and 16 present that the use of the average pool strategy in the sequential textual representation learning process can help our proposed W-KG2Vec model can achieve the highest accuracy performance in both YAGO and Freebase datasets.

Fig. 15
figure 15

Comparative studies between different types of vector combination strategies for W-KG2Vec model in YAGO-large dataset

Fig. 16
figure 16

Comparative studies between different types of vector combination strategies for W-KG2Vec model in Freebase-large dataset

4.3.3 Text-based similarity weight studies via different textual embedding techniques

The W-KG2Vec model mainly applied the textual representation of BERT-Text2Vec which is a combination of BERT pre-trained model and LSTM encoder. To demonstrate the outperformance of this combination in comparing with recent common textual representation learning techniques, such as topic modeling—Latent Dirichlet Allocation (LDA), Word2Vec and Doc2Vec, we implemented W-KG2Vec with different textual representation learning mechanisms to solve the same similar entities search task. The details of different textual representation learning implementations for W-KG2Vec are described as the following (as shown in Table 9):

Table 9 Details of different textual representation learning implementations for W-KG2Vec

To evaluate the performance of each textual representation learning techniques, we varied the size of two given KGs (YAGO and Freebase) from 10 to 100%. Each modified implementation of W-KG2Vec (described in Table 9) is applied to solve similar entities search task and reported the accuracy performance in terms of nDCG@30. Tables 10 and 11 present the accuracy outputs for each textual representation learning implementation of W-KG2Vec in terms of nDCG@30 on YAGO-small and Freebase-small datasets.

Table 10 Evaluations of different textual representation learning implementations for W-KG2Vec on YAGO
Table 11 Evaluations of different textual representation learning implementations for W-KG2Vec on Freebase

The experimental results (Figs. 17 and 18) demonstrate the effectiveness of applying our proposed BERT-Text2Vec for textual representation learning in comparing with recent well-known embedding techniques (LDA, Word2Vec and Doc2Vec). In fact, previous textual representation learning techniques are lack of evaluations on the sequential relations between words in different document’s contexts, therefore it leads to the significant decrease in the quality of textual representation outputs. In overall, the original W-KG2Vec implement with BERT-Text2Vec based textual representation learning outperforms about 5.63% (YAGO) and 5.78% (Freebase) other textual embedding techniques.

Fig. 17
figure 17

Comparisons of different textual representation learning implementations for W-KG2Vec on YAGO

Fig. 18
figure 18

Comparisons of different textual representation learning implementations for W-KG2Vec on Freebase

4.3.4 Parameter sensitivity studies

Model’s parameters of network representation learning process In this section, we demonstrate experiments on the influence of model’s parameters, including the walk length (\(l\)), number walk per node (\(w\)) and the dimension of embedding vector (\(d\)) on KG’s embedding task in YAGO-small dataset. Following the same experimental procedure on similar entities searching task in YAGO-small, we varied the values of walk length (\(l\)), number walk per node (\(w\)) and embedding vector dimension (\(d\)) and reported the changes on proposed W-KG2vec’s accuracy performance in terms of nDCG@30 metric.

We conducted multiple experiments on similar entities search task with different values of model’s parameters. Figure 19 shows the experimental outputs for similar entities search task as the function of each of three model’s parameters while fixing the other two parameters. From the results, we observe the performance of our proposed W-KG2Vec model is gradually improved by increasing the number of walk per node (\(w\)). The model’s accuracy performance becomes stable when number of walk per node is going above 800. Similar to that, the increase of walk length (\(l\)) parameter also leads to significant improvement on overall model’s accuracy which is stable at above 120. The number of node and walk length parameters is considered as important in the aspect of semantic meaning capturing between entities in KGs. The higher setup number of (\(w\)) and (\(l\)) values are equal to higher number of contextual entities which are generated for each target entity. In fact, for the training set generation process of our proposed W-KG2Vec model, each generated set of contextual entities of each target entity is majorly controlled by these two parameters which ensure the size of generated contextual entities is large enough to guarantee the quality of learnt entity embedding vectors.

Fig. 19
figure 19

Parameter sensitivity studies on W-KG2Vec model

For the embedding vector dimension parameter (\(d\)), we can see that when the embedding dimension is larger than 130, the accuracy performance of our proposed W-KG2Vec model reaches the highest value and become stable. The chosen number of embedding dimension for entities is also important which can influence the performance of overall representation learning process. With too high value of (\(d\)), our model will need more time as well as computer’s resource to complete the training process. In order words, the configured dimensionality of embedding vector is quite important in network/KG embedding approach which is frequently chosen heuristically by evaluating the size and actual distinctive feature of entities in the given KG.

Different textual embedding approaches for W-KG2Vec model In previous section, we have demonstrated studies related to the use of different textual embedding models beside the BERT-Bi-LSTM, including LDA, Word2Vec and Doc2Vec models. To thoroughly evaluate the changes of model’s hyperparameters, including the size of embedding vector dimensionality (Bert-Bi-LSTM, Word2Vec and Doc2Vec) and number of latent topics (LDA). As shown from experimental outputs in Fig. 20, our proposed W-KG2Vec model reaches the highest and stable accuracy performance with the value of embedding vector dimensionality (\(d\)) at range \(>110\) for Bert-Bi-LSTM approach, \(>80\) for both Word2Vec and Doc2Vec approaches. With the approach of LDA topic modeling, the model also reaches the highest accuracy performance with number of latent topic (\(k\)) is over 8.

Fig. 20
figure 20

Parameter sensitivity studies on different textual embedding approaches for W-KG2Vec model

4.3.5 System performance and scalability evaluations

In the era of big data with tremendous large-scaled KGs, it is important to demonstrate the efficiency and scalability of the proposed KG embedding model. To evaluate the efficiency and scalability of our proposed W-KG2Vec model, we conducted experiments with default configurations (as shown in Table 12) on a single server with Intel® Xeon® E7-8890 v4 CPU—24 cores CPU and 64 Gb memory. We ran the experiments with different number of threads from 1 to 24 and reported the speedup rates with respect to number of threads usage. In this experiment, we used YAGO-{small, large} which is considered as a quite large KG with more than 3.8 M entities and 5.1 M relations as the main dataset. Figures 21 and 22 show the average speedup rate of our proposed W-KG2Vec model over multi-threaded running environment in two versions of YAGO dataset. As shown from the experimental outputs, the W-KG2Vec model achieves an acceptable sub-linear speedup rate which is quite close to the optimal line. Overall, this experiment demonstrates that our proposed W-KG2Vec model is efficient and scalable for handling large-scaled KGs with millions of entities and relations.

Table 12 Experimental setup parameters for W-KG2Vec model
Fig. 21
figure 21

Average speed-up rate for our proposed W-KG2Vec model in YAGO-small

Fig. 22
figure 22

Average speed-up rate for our proposed W-KG2Vec model in YAGO-large

5 Conclusions and future works

In this paper, we formally present a novel approach of text-enhanced KG representation learning, called W-KG2Vec. In context of heterogeneous network with diverse types of entities and relations, KG embedding is considered as a challenging task for complex similar searching/querying task. A common technique for solving the complex entities searching in KGs is modelling user’s queries as meta-path-based patterns. However, most of well-known KG embedding techniques are considered as direct relation/triple-based approach which is incapable to handle complex similar entities searching task. Other recent path-based KG embedding techniques are also lack of thorough evaluations on textual semantic meanings as well as diverse types of relations between KG’s entities which leads to the decrease in quality of KG representation. To address these challenges, we proposed a joint textual representation learning with weighted meta-path-based random walk mechanism to leverage the accuracy performance of KG embedding task. The introduction of BERT-Text2Vec is our first contribution in this paper. BERT-Text2Vec is a combination of BERT pre-trained model and LSTM encoder which is aimed to learn the bidirectional sequential representations of textual descriptions of KG’s entities. Then, these textual representations are used to compute the text-based similarity weights of pairwise entities in the given KGs. The computed text-based similarity weights between entities play an important role in our proposed weighted meta-path-based random walk strategy. The weighted meta-path-based random walk mechanism supports to generate contextual entities for each entity in KGs which are used for training the representation model by applying heterogeneous skip-gram method. Extensive experiments on benchmark datasets demonstrate the capability of W-KG2Vec model on better handling complex entities searching/querying task in comparing with recent state-of-the-art KG embedding baselines. Our future works include various improvements which are mainly related to model’s scalability. We tend to extend our proposed W-KG2Vec model to incorporate with the distributed processing platform such as Apache Spark which enable the capability of handling massive KGs with billions of entities and uncountable relations.