1 Introduction

Recommendation systems (RSs), which aim to determine user preferences and provide items that the user might be interested in, have undergone rapid growth during the past decade in various fields, such as search engines, video portals, and e-commerce [4, 7, 22]. As a significant recommendation approach, traditional embedding-based methods have achieved impressive results by embedding user-item data into a low-dimensional continuous vector space and harnessing abundant side information. Despite their developments for several popular benchmarks, the recommendation results are calculated between latent vectors and do not take their relations into account, which leads to the poor explainability of embedding-based methods.

To address this issue, knowledge graph (KG), which is a well-structured auxiliary data format that evolved from semantic webs, has naturally been applied in recommendation systems to boost their reasoning ability and explainability [19, 27]. Information in a KG is organized as triples (head, relation, tail), and the head and tail entities are combined with relations. In KG-enhanced recommendation methods, all the users and items are regarded as entities together with other background information. By exploiting multihop relations from a target user entity within a KG, a user preference and its semantics can be explicitly revealed. Such a correlation is recorded as a path. Clearly, each target user entity has copious paths that lead to different items that the user may like, offering precise textual information for biclassification and top-K recommendation tasks. As a result, these path-based methods [10, 21] quickly received considerably more research interest than traditional translation-based methods such as TransE [2], TransH [25] and the collaborative knowledge graph embedding method (CKE) [32].

In terms of evaluating a recommender system augmented by a KG, we consider it crucial that convincing recommendation results be accurate and explainable. However, none of the existing path-based methods can satisfy the above two conditions at the same time. First, they neglect the difference between paths in the same user-item interaction. Taking a simple case in the music recommendation field as an example, the act “(user, likes, song)∧(song, sung_by, singer)∧(singer, sing, song)” obviously contributes more to results than the act “(user, likes, song)∧(song, sung_by, singer)∧(singer, is_brother_of, singer)∧(singer, sing, song)” – the latter is uncommon and less credible in explaining why this song should be recommended to this user. Indiscriminately processing the paths reduces information utilization and negatively affects explainability. Second, existing path-based methods are not as impressive in terms of accuracy and can be even worse than some traditional embedding-based models. Such defects, more seriously, increase cold-start costs in some top-k recommendation tasks. In a knowledge-aware path recurrent network (KPRN) [24], the prior knowledge that is used to complete the KG accounts for 50% of the original dataset, making it difficult to apply the model in industry (Figure 1).

Fig. 1
figure 1

A toy example to illustrate why incorporating “relations” into recommendations can enhance explainability. In this music knowledge graph, there are different relations between entities. Finding all the relations related to “If Tom likes Scream” can bring more insights and make the result more reasonable

To bridge these gaps, a new recommendation model, named the Path-enhanced Recurrent Network (PeRN), is proposed that extracts not only the path information in a KG but also a metapath set in order to differentiate the path contribution to one user-item pair. Inspired by previous work [33], we innovatively use a metapath, a general concept in a heterogeneous information network (HIN), to generalize the structure of complicated paths. The metapath is a path schema consisting of a series of entity-type data and relation data. By using a metapath, all the paths in a dataset can be abstracted as several kinds of user habits, which will be used to calculate credibility values in the entropy encoder in our model. Thus, the confidence ratio between different paths in the same user-item pair can be obtained. For the path set, we adopt a bidirectional long short-term memory (bi-LSTM) network and a two-layer fully connected perceptron to compute sequential entity and relation vectors to determine a score. Afterwards, a weighted pooling layer is constructed to combine these path scores and credibility values from the entropy encoder to obtain a predicted result. To learn the parameters effectively, we propose using a logarithmic loss function along with a ridge penalty, i.e., L2 regularization, to train our model. Furthermore, we design a bidirectional search method, which is also metapath-aided, to enhance the efficiency of path extraction and make PeRN easier to use.

To validate the ability of PeRN in recommendation accuracy and solve the cold-start issue, we conduct extensive experiments on two real-world datasets. Additionally, we give a user-item interaction example that is randomly chosen from the dataset to illustrate the increased explainability of our model. The main technical contributions of this paper include the following:

  • An end-to-end model, PeRN, with an entropy encoder is proposed to enhance the explainability of the recommendations. The metapath, which helps to extract and encode user habits, is innovatively applied in general path-based methods. Our metapath-based path method avoids the limitation that the paths cannot be differentiated in one user-item interaction. Compared with traditional path-based models, the proposed PeRN achieves more expressive explainability and successfully reduces the cold-start costs.

  • A novel path extraction method is proposed that is augmented with a bidirectional strategy. Similar to the Sentinel algorithm, the proposed bidirectional strategy can efficiently extract path data from a KG, which makes it possible to apply PeRN in real-world scenarios. Based on this, we also perform extensive experiments on two real-world datasets from biclassification and top-K perspectives. Our proposed PeRN outperforms several representative baselines in terms of accuracy. These demonstrations also highlight the importance of integrating KGs into recommendations and verify the practicality of our proposed methods.

The rest of this paper is arranged as follows: Section 2 gives a brief review of two kinds of related works. Then, Section 3 describes the framework and notation of the PeRN model, and Section 4 explains PeRN in detail. The experimental results that verify this model are shown in Section 5. Finally, Section 6 contains conclusions and some ideas for future work.

2 Related works

This section provides a general summary of several state-of-the-art methods of integrating KGs into recommendations, which can be largely divided into translation-based and path-based methods.

2.1 Translation-based methods

Prior research has proposed various techniques [1, 2, 5, 13, 15, 25] for embedding a KG into a low-dimensional vector space in a way that makes the KG computable. These methods can be roughly divided into two categories [23]: 1) translational distance models, including translation-based and other distance models, and 2) semantic matching models, including tensor-factorization-based and neural-network-based models. Translation-based models such as TransE [2] and its variants [12, 13, 25], which use the idea of translation to transform entities and relations into vectors to embed KGs, have been widely used in KG recommendation because of their simplicity and effectiveness. CKE [32] first employed Bayesian TransR [13] to generate the user latent vector and KG-aided item latent vector to collaboratively learn the predicted result. DKN [27], leveraging four different translation-based models [2, 12, 13, 25] to embed a KG and enrich the side information, also applied an attention-based deep convolutional neural network to the news recommendation field. More recently, the translation-based user preference model (TUP) [3] employed TransH [25] to predict user preferences from a KG to improve recommender systems.

Such translation-based methods significantly increase the accuracy of the results. However, they fail to explore the correlations (i.e., multi-hop relations) among user-item pairs. In other words, these methods fail to explain why the user has interest in these items. Thus, we argue that these methods lack explainability and reasoning ability in recommendations.

2.2 Path-based methods

Regarding path-based methods, Yu et al. [31] integrated a heterogeneous information network into matrix factorization for personalized entity recommendation. This was a new idea that not only first introduced user-item-path conception to recommender systems but also inspired other researchers to apply paths in KGs for convincing recommendations. As a consequence, various path-based approaches [11, 24, 33] have been developed to memorize sequential implicit information in KGs. Knowledge-based sequential recommendation (KSR) [11] incorporates a gated recurrent unit (GRU) and key-value memory network (KV-MN) to capture sequential user preferences from a knowledge base, while recurrent knowledge graph embedding (RKGE) [19] leverages recurrent network batches to embed path semantics to increase interpretability. In KPRN [24], a long short-term memory (LSTM) network is utilized to encode paths and explore the connectivity between users and items. Additionally, explainable interaction-driven user modeling (EIUM) [10] extends this idea to a multimodal knowledge base and designs a self-attention matrix to mine deep information from a path. With the development of graph neural networks (GNN), there are also some path-based methods adopting GNN as their path encoder. Path conditioned Graph Convolutional Network (PGCN) [30] uses graph convolutional network (GCN) to encode path information between entities, while Price-aware User Preference-modeling (PUP) [34] adopt GCN as a part of their self-defined encoder-decoder to record pairwise interactions for better recommendation.

Although these methods improve the explainability and reasoning ability of recommender systems, there is still a severe defect: the underutilization of mining semantics in paths. This flaw also causes poor performance in dealing with cold-start issues: the prior knowledge needed for completing a KG is extremely costly. Moreover, path extraction is also a time-consuming and labor-intensive step in path-based methods, as the time complexity of the algorithm grows exponentially with the length of the path.

3 Model design

3.1 Framework of PeRN

With a given knowledge graph and a target user-item interaction, the path set and its metapath set can be extracted by a bidirectional path extraction algorithm. After an embedding step, these paths and metapaths are processed to obtain a certain score and a weight by a recurrent network encoder and an entropy encoder, respectively. Combining these scores by using a weighted pooling layer, the output of PeRN, a recommendation prediction, is obtained. The overall framework of PeRN is illustrated in Figure 2.

Fig. 2
figure 2

Framework of PeRN

PeRN contains four key components: a bidirectional path extraction algorithm to extract paths effectively from a KG between two target entities (i.e., a user and item), a recurrent network encoder that is based on a Bi-LSTM neural network and a fully connected layer to encode the path information as a certain value, an entropy encoder that adopts information gain and metapaths to assign a weighting score to each path in one user-item interaction, and a weighted pooling layer that combines the above values and weighting score to obtain a predicted result.

3.2 Problem definition

Before describing our model, we formally define the notation used throughout this paper in Table 1. Similar to other recommender systems, we let \(\mathcal {U} = \{u_{1}, u_{2}, ..., u_{|\mathcal {U}|}\}\) and \(\mathcal {I} = \{i_{1}, i_{2}, ..., i_{|\mathcal {I}|}\}\) denote the user set and item set, respectively. \(\mathcal {A} = \{(u, i)|u \in \mathcal {U}, i \in \mathcal {I}\} = \{a_{1}, a_{2}, ..., a_{|\mathcal {A}|}\}\) represents all the interactions between users and items in our dataset.

Table 1 Notation

Definition 1

Knowledge Graph. As the entity set \(\mathcal {E}\) and relation set \(\mathcal {R}\) have been defined, a knowledge graph (KG) can be defined as \(\mathcal {K}\mathcal {G} = \{(h, r, t)|h, t \in \mathcal {E}, r \in \mathcal {R}\}\), where (h,r,t) is a triple combining a head entity h and tail entity t through a relation r. Here, every user and item in \(\mathcal {U}\) and \(\mathcal {I}\) can be searched as an entity in \(\mathcal {K}\mathcal {G}\), which makes the extraction of paths between user-item interactions possible. By abstracting entities in \(\mathcal {E}\) to entity types (shown as the function typeof() in the table above), the schema of \(\mathcal {K}\mathcal {G}\) can be revealed along with relations, and this is expressed as a two-dimensional matrix \(\mathcal {G}\).

Definition 2

Path and Metapath. Given an interaction ak = (um,in), a sequence of triples that connect user um and item in can be found in \(\mathcal {K}\mathcal {G}\) as {(um,r1,e1),(e1,r2,e2),...,(el− 1,rl,in)}, which we record as a path: \(\small {p = u_{m}\overset {r_{1}}{\rightarrow }e_{1}\overset {r_{2}}{\rightarrow }e_{2} ... \overset {r_{l}}{\rightarrow }i_{n}}\). Therefore, all the qualified paths of an interaction ak and interaction set \(\mathcal {A}\) are denoted as \(P_{k} = \{p_{1}, p_{2}, ..., p_{|P_{k}|}\}\) and \(\mathcal {P} = \{P_{1}, P_{2}, ... P_{|\mathcal {A}|}\}\), respectively. As each path forms a metapath by abstracting its entities to entity types, we denote mp as the metapath of path p and \(MP_{k} = \{mp_{1}, mp_{2}, ..., mp_{|MP_{k}|}\}\) as the metapath set of path set Pk (and of interaction ak). The metapath of all interactions, denoted as a set \({\mathscr{M}}\), can be obtained by a certain traversal of the schema graph \(\mathcal {G}\).

In addition, there are three points to be emphasized: 1) \(\mathcal {U}, \mathcal {I}\subsetneqq \mathcal {E}\) and \(\mathcal {U}\cap \mathcal {I}=\varnothing \). 2) MPk and Pk are both generated from ak, so \(|MP_{k}| \leqslant |P_{k}|\), and equality holds when all path types in |Pk| are different. 3) \({\mathscr{M}} = MP_{1}\cup MP_{2}\cup ...\cup MP_{|\mathcal {A}|}\).

Definition 3

PeRN Task. With the given user-item interaction ak and its path set Pk, the goal of PeRN is formulated as follows:

$$ \hat{y}_{k}=f_{{{\varDelta}}}(P_{k}), $$
(1)

where \(\hat {y}_{k}\) is the predicted score of interaction ak and f denotes the function of PeRN with parameters Δ.

4 Path-enhanced recurrent network

In this section, we thoroughly describe and elaborate our proposed PeRN for incorporating KGs into recommendations. First, we design a bidirectional path extraction algorithm to boost the efficiency of discovering path sets between user and item entities in a KG. Then, we adopt a Bi-LSTM network as a model to embed the path set and remember it as a set of predicted scores. Furthermore, the entropy encoder is created to differentiate the contributions of the paths in a path set by analyzing the information gain of their metapath. Finally, a weighted pooling layer and optimization steps are used to combine the scores and learning.

4.1 Bidirectional path extraction

A KG generally contains millions of entities and relations, which indicates that it is labor-intensive and time-consuming to find all paths between two entities. In addition, the difficulty of searching a path increases exponentially with its length, which makes path extraction even more difficult.

To address this issue, a metapath-aided bidirectional path extraction algorithm is proposed to retrieve all qualified paths, as described in Algorithm 1. As previous work [18] discussed, we regard paths that are longer than six hops as noise. Thus, this bidirectional scheme can change the complexity of the search step from a maximum of six hops to a maximum of three hops. In the preliminary steps, we first abstract the KG to its schema graph by changing the entities in the triples to entity types. By removing all duplicates among these processed triples, the schema graph can be displayed via a matrix in which every dimension has a list of all entity types. The elements in the matrix can store the types of relations between the entity types. Accordingly, the whole metapath set \({\mathscr{M}}\) can be retrieved by traversing this directed graph. Given a knowledge graph \(\mathcal {K}\mathcal {G}\), user-item interaction set \(\mathcal {A}\) and metapath set \({\mathscr{M}}\), Algorithm 1 can help us find a whole path set \(\mathcal {P}\). After the initialization of \(\mathcal {P}\), for each interaction ak, we adopt the depth-first-search (DFS) idea to retrieve candidate paths and the metapath-aided idea to choose target paths bidirectionally. In fact, candidate sets S1 and S2 can be used in parallel for different metapaths so that we significantly enhance the efficiency of path extraction by designing a multithreaded program.

figure a

4.2 Recurrent network encoder

With the increasing development of deep learning models, recurrent neural networks (RNNs) have become increasingly widely used in processing sequence data such as path information [19, 24]. In the PeRN model, as relations in paths are not always in one direction – for example (Michael Jackson, Cooperate_with, Janet Jackson) and (Janet Jackson, Cooperate_with, Michael Jackson) – we choose a bi-LSTM-based model, as illustrated in Figure 3, to better sequence information and output the predicted score of the path. In other words, we input a path p and output its predicted score s from this network.

Fig. 3
figure 3

The architecture of the recurrent network encoder in PeRN

A path here can be recorded as a sequence of entities and relations such as p = [um,r1,e1,r2,e2...,rl,in] as well as a number of relations less than or equal to six. To embed this multihop data into a series of vectors, we first transform this l-hop path to l + 1 consecutive vectors such as (um,typeof(um),r1),(e1,typeof(e1),r2),...(el− 1,typeof(el− 1),rl),(in,typeof(in),null). Then, the entity index, a number from 0 to millions, is mapped to a d − 2-dimensional vector with a similar number. After concatenating the entity vector with the entity type index and relation index, we can obtain a d-dimensional vector representing a hop on the path. The q-th vector in the path can be defined via the following equation:

$$ \alpha_{q} = \text{Map}(e_{q-1})\oplus e^{\prime}_{q-1} \oplus r_{q}, $$
(2)

where ⊕ is the concatenation operation and \(e^{\prime }\) is the type of entity e. After this step, path p is represented as a vector set {α1,α2,...,αl+ 1}, which can be taken as the input of this model. With a strong ability to remember sequential semantics [14], LSTM [8] is chosen to be the main body of the recurrent network encoder, and the q-th vector αq of the input vector set is computed as follows:

$$ \begin{aligned} f_{q} &= \sigma(W_{f}\dot (h_{q-1}\oplus \alpha_{q})+b_{\small f}) \\ i_{q} &= \sigma(W_{i}\dot (h_{q-1}\oplus \alpha_{q})+b_{i}) \\ \widetilde{C}_{q} &= \tanh(W_{C}\dot (h_{q-1}\oplus \alpha_{q})+b_{C}) \\ C_{q} &= C_{q-1} \odot f_{q} + \widetilde{C}_{q} \odot i_{q} \\ o_{q} &= \sigma(W_{o}\dot (h_{q-1}\oplus \alpha_{q})+b_{o})\\ h_{q} &= o_{q} \odot \tanh(C_{q}) \end{aligned} $$
(3)

where fq, iq and \(o_{q} \in \mathbb {R}^{d^{\prime }}\) denote the forget, input and output gates; \(\widetilde {C}_{q}\) and \(C_{q} \in \mathbb {R}^{d^{\prime }}\) denote the candidate value vector and memory state vector; Wf, Wi, WC and \(W_{o} \in \mathbb {R}^{d^{\prime }\times (d^{\prime }+d)}\) are weight matrices initialized with random values; bf, bi, bC and bo are the bias vectors of each gate or cell; σ and \(\tanh \) denote sigmoid and hyperbolic tangent activation functions; and ⊙ and ⊕ represent the Hadamard product and concatenation, respectively. Consequently, the hidden state vector \(h_{q} \in \mathbb {R}^{d^{\prime }}\) of the q-th step is obtained by the last hidden state hq− 1 given αq. After l + 1 iterations, the last hidden state vector hl+ 1 can be considered to hold the sequential path information. To explain more precisely, we simplify this process as the following equation:

$$ h_{l+1} = \text{LSTM}([\alpha_{1}, \alpha_{2}, ..., \alpha_{l+1}]). $$
(4)

As shown in Figure 2, our PeRN model takes bi-LSTM into consideration; it adopts a forward LSTM and a backward LSTM and then concatenates the bidirectional results to remember path information more reliably. This step can be formulated as the equations below:

$$ \begin{aligned} &\overrightarrow{h}_{l+1} = \text{ LSTM}([\alpha_1, \alpha_2, ..., \alpha_{l+1}])\\ &\overleftarrow{h}_{l+1} = \text{ LSTM}([\alpha_{l+1}, ..., \alpha_2, \alpha_1])\\ &\quad\quad\quad h = \overrightarrow{h}_{l+1}\oplus \overleftarrow{h}_{l+1}. \end{aligned} $$
(5)

After embedding path p into a representative vector h, the final step of the recurrent network encoder is to convert it to a predicted score by establishing a simple neural network with two fully connected layers. Therefore, the score s of the path can be calculated by the following equation:

$$ s={W}_{2}^{\mathrm{T}}\text{ReLU}\left( W_{1}^{\mathrm{T}}h\right). $$
(6)

Here, \({W}_{1}^{\mathrm {T}}\) and \({W}_{2}^{\mathrm {T}}\) represent the coefficient weights of layer 1 and layer 2, respectively, and we adopt a rectified linear unit (ReLU) as the activation function in each neuron with omitted bias. In fact, the score s is a 1 × 1 matrix, and we directly treat it as a scalar for simplicity.

4.3 Entropy encoder

Given an interaction ak = (um,in) and its extracted path set \(P_{k} = \{p_{1}, p_{2}, ..., p_{|P_{k}|}\}\), the scores of each path can be stored in a set \(S_{k} = \{s_{1}, s_{2}, ..., s_{|P_{k}|}\}\). By abstracting the entity instances to entity types in a path, the metapath set of ak can be denoted as \(MP_{k}=\{mp_{1}, mp_{2}, ..., mp_{|MP_{k}|}\}\), where \(|MP_{k}|\leqslant |P_{k}|\) (cf. Section 3). Clearly, it does not make sense to predict ak by taking the weighted averages of the path scores (cf. Section 1). To address this issue and increase the explainability, we design an information entropy-based method.

Inspired by recent applications and advances in the computer vision field [17, 20], we design an entropy-based weighting encoder to differentiate path contributions by computing the information gain for specific interactions. In practice, all the interactions are either positive feedback or negative feedback, which indicates that each path p in a path set Pk shares the same target, 1 or 0. Here, we collectively define a path or metapath with target 1 as a confidence path (CP) and otherwise as a non-confidence path (NP). By traversing and counting all paths p in the whole path set \(\mathcal {P}\), it is possible to obtain the frequency of “p is a CP” considering that it is a binary classification problem. Therefore, the information entropy of the event “p is a CP” (denoted as event D, which takes the values 0 and 1) is defined as follows:

$$ \begin{array}{@{}rcl@{}} \text{Ent(D)} = -\underset{j\in{0, 1}}{\sum}\text{P(D}=j)\log_{2}\text{P(D}=j). \end{array} $$
(7)

Although there are millions of paths in \(\mathcal {P}\), the number of metapaths is limited, and we denote it as an integer m. Therefore, we can endow path m with Boolean features to determine the type of its metapath and further judge its contribution to event D. Similar to (5), for feature Eg(g ∈ [1,m]), we can find its conditional entropy for event D as follows:

$$ \text{Ent}(\mathrm{D}|E_{g})=\underset{i\in{0,1}}{\sum}\mathrm{P}(E_{g}{=}i)\text{Ent}(\mathrm{D}|E_{g}{=}i), $$
(8)

where Ent(D|Eg = i) is the conditional entropy; it is obtained by fixing Eg=i and is written as the equation below:

$$ \text{Ent}(\mathrm{D}|E_{g}=i)=-\underset{j\in{0,1}}{\sum}\mathrm{P}(\mathrm{D}=j|E_{g}=i)\log_{2}\mathrm{P}(\mathrm{D}=j|E_{g}=i). $$
(9)

In this way, the information gain of event D for a given feature Eg can be obtained:

$$ \text{Gain}(\mathrm{D}, E_{g})=\text{Ent}(\mathrm{D})-\text{Ent}(\mathrm{D}|E_{g}). $$
(10)

For a given interaction ak and its metapath set MPk, each mpi(i ∈ [1,|MPk|]) can determine its unique corresponding information gain by matching Eg=mpi. Here, we normalize the information gain of each metapath to obtain its weight:

$$ w_{i}=\frac{\text{Gain(D,}E_{g} = mp_{i})}{{{\varSigma}}_{j=1}^{|MP_{k}|}\text{Gain(D,}E_{g}=mp_{j})}, $$
(11)

where wi is the weight of the i-th path pi in the path set of ak; we can further differentiate the score si in Sk. All the weights of the same interaction ak are put in a weight set Wk for a later step.

4.4 Objective of optimization

After obtaining a score set \(S_{k} = \{s_{1}, s_{2}, ..., s_{|P_{k}|}\}\) and its corresponding weight set \(W_{k}=\{w_{1}, w_{2}, ..., w_{|P_{k}|}\}\), we adopt a weighted pooling layer that combines them to obtain the final predictive score, which is formulated as:

$$ \hat{y}_{k}=\sigma\left( \sum\limits_{i=1}^{|P_{k}|}w_{i}s_{i}\right), $$
(12)

where σ is a sigmoid activation function that maps the final score to a range of 0 to 1. The weighted pooling layer, which can be regarded as an attention mechanism, combines the path scores with their own weights, indicating the importance of the paths in the target user-item interactions.

Since all the interactions in \(\mathcal {A}\) can be considered positive feedback or negative feedback (cf. Section 4.3), we regard our recommendation task as a binary classification task with 0 representing negative and 1 positive, similar to previous work [24]. We adopt cross-entropy loss to optimize our result. With the given interaction a, our loss function is defined as follows:

$$ \mathcal{L}=-\underset{a\in\mathcal{A}}{\sum}(y\log{\hat{y}}+(1-y)\log({1-\hat{y}})), $$
(13)

where \(\mathcal {A}=\{a_{1}, a_{2}, ...,a_{|\mathcal {A}|}\}\) is the whole user-item interaction set in the dataset and y and \(\hat {y}\) represent the observed user-item feedback and the predictive score of a, respectively. We also conduct L2 regularization (i.e., the ridge penalty) on the parameters of our PeRN, which is omitted here for simplicity.

5 Experiments

We perform various experiments on 2 real-world recommendation datasets, which are based on music recommendation and movie recommendation scenarios, to evaluate our PeRN with several state-of-the-art baselines. To more realistically construct a KG for recommendation and mine user preferences by path, we adopt two benchmark datasets, and the statistics are shown in Table 2.

  • KKBox Footnote 1, which is a music domain recommendation dataset from the WSDM Cup Challenge 2018 and is provided by the KKBox Music Streaming Service.

  • IM-1M, composed of Internet Movie Database (IMDb)Footnote 2 and MovieLens-1M Footnote 3 datasets, is a common movie recommendation dataset that has recently been widely used in KG-enhanced recommendation tasks [19, 24].

Table 2 Statistics of KKBox and IM-1M

To evaluate our PeRN more comprehensively and demonstrate its rationality, we use evaluation metrics from the perspectives of the binary classification recommendation task performance, the top-K recommendation task performance and the ability to handle cold-start issues, which are shown below:

  • Precision (P), recall (R), F1-score, and area under the curve (AUC): These five metrics are adopted to represent the overall accuracy of the biclassification task. Precision and recall are commonly used to address the imbalance of positive and negative samples. The F1-score is the harmonic mean of precision and recall. The AUC, the threshold of which we set in the range of 0.1 to 0.9, is calculated as in the following equation:

    $$ \text{AUC} = \frac{{\sum}_{i\in positive}(P+N+1-rank_{i})-\frac{P(1+P)}{2}}{P \times N}, $$
    (14)

    where P and N are the numbers of positive and negative interactions, respectively, and ranki denotes the rank of i’s predicted score.

  • Normalized discounted cumulative gain (NDCG@K): As the most common metric in the top-k recommendation task, the NDCG@K evaluates the model performance in terms of position influence, which is calculated as follows:

    $$ \text{NDCG@K} = \frac{1}{\alpha}\text{DCG@K}=\frac{1}{\alpha}\sum\limits_{i=1}^{K}\frac{2^{rel_{i}}-1}{\log_{2}(i+1)}, $$
    (15)

    where reli indicates the relevance of the recommendation at position i, K is the size of the target list, and α is a standardization constant that is the maximum value of the DCG@K. Considering that the sizes of these two datasets are different, we set K in KKBox and IM-1M as {3, 5, 10, 15} and {2, 6, 10, 12}, respectively.

  • Percentage of interactions used to complete the KG (PcKG): In measuring the cold-start cost, we set the PcKG in the range {10%, 20%, 30%, 40%, 50%}. Here, a lower PcKG with a higher score of the other metrics indicates a better ability to handle cold-start issues.

5.1 Experimental setup

The fundamental part of obtaining the KG-enhanced recommendation dataset is constructing the domain of the KG. First, the basic part of the KG can be selectively generated from the background information in the original dataset. Then, to integrate the user set into the KG, which makes finding paths between the users and items possible, a group of interactions should be added to the KG as compensation. KKBox not only provides a large amount of user-item interaction data with positive feedback 1 and negative feedback 0 but also several pieces of side information, such as the artist, lyricist, composer, and genre, which indicates that this music domain KG can be straightforwardly constructed. In IM-1M, the IMDb contains comprehensive auxiliary information such as the core contributors, duration, genre and budget of the movie, while MovieLens-1M provides more than 1,000,000 user-item interactions with rating scores {1, 2, 3, 4, 5}. Thus, a movie domain KG can be constructed by mapping the movie titles in IMDb and MovieLens-1M.

To better fit our proposed model, all the interactions in which the rating scores are 3 are omitted here to enable biclassification, and the other 4 scores are normalized for the top-K task. In the processed KKBox and IM-1M data, approximately 50% of the original interactions are treated as valid interactions (the exact numbers are shown in Table 2). The other part, which is used for completing the KG, can be used to measure the severity of the cold-start issue by the P ercentage of interactions used to c omplete the KG (PcKG). In terms of path data, we first provide our definition of the KG: the schema graphs of the music domain KG and movie domain KG that we designed are shown in Figure 4. In KKBox, U: user, I: item (song), L: language, Ar: artist, G: genre. In IM-1M, U: user, I: item (movie), Ac: actor, D: director, G: genre. In KKBox, we select the user, item (song), language, artist and genre as the target entity types. The relation “CooperatesWith” is extracted from the condition “multiple artists in one song”. In IM-1M, we choose the user, item (movie), director, actor and genre as the target entity types. Additionally, there are “actor_1_name” and “actor_2_name” attributes of a movie, and we regard the relation between them as “Co-starsWith”. Metapaths can be extracted by certain traversals in the schema graph, and paths are extracted by Algorithm 1.

Fig. 4
figure 4

Schema graphs of KKBox (left) and IM-1M (right)

Furthermore, we omit pretrained parameters in our proposed model to enable pair comparison with several baselines. To reasonably embed the paths into a vector space, we map the large fluctuating entity index to a 62-dimensional vector and concatenate it with the entity type index and relation type index. Therefore, the size of each input vector of the LSTM is 64, and each has a length of up to 7 (i.e., a 6-hop path). If the length of a path is less than 7, then it is completed with a zero vector. The batch size is 256 here. We adopt stochastic gradient descent (SGD) in the optimization step with a learning rate in the range {0.001, 0.01, 0.1, 0.5}, while the L2 regularization coefficients are tuned in the range {10− 5, 10− 4, 10− 3, 10− 2}. In our experiments, we compared our PeRN with the widely used recommendation methods below, which are based on matrix factorization (MF), factorization machines (FMs), KG translation, metapaths in a heterogeneous information network (HIN) and paths in a KG. The brief descriptions of these methods are as follows:

  • MF [16]: This is a standard matrix factorization method with Bayesian personalized ranking (BPR) loss that regards interactions as elements in the user-item matrix to predict unknown interactions.

  • AFM [28]: This method is a factorization model combined with an attention mechanism that excavates potential relations and interactions among user preferences and item features.

  • RippleNet [26]: The key of this method is preference propagation, which regards the historical interests of users as a seed set and propagates user preferences in the KG that contain extra information.

  • MEIRec [6]: This is a metapath-guided method that contains rich user-item information and interactions based on a HIN and utilizes structural information fully for intent recommendation.

  • KPRN [24]: KPRN is a representative method for integrating paths into KG recommendations through recurrent neural networks. Although KPRN achieves great performance in both binary classification and top-K recommendation tasks, it has the severe cold-start issue that its PcKG can reach 50%.

5.2 Performance evaluation

Table 3 and Figure 5 report our experimental results on binary classification recommendation and the ability to solve the cold-start issue in the top-K recommendation task, respectively. In Table 3, the bold numbers indicate the best result of each column. In Figure 5, ‘*’ indicates that the PcKG of KPRN is consistently 50%.

Table 3 Summary of the performance on the binary classification recommendation task for all baselines and our proposed PeRN on the KKBox and IM-1M datasets
Fig. 5
figure 5

Performance of PeRN in the top-K task and cold-start costs, measured by NDCG@{3, 5, 10, 15} in KKBox and NDCG@{2, 6, 10, 12} in IM-1M

For the biclassification task, regarding the binary classification recommendation issue, the main challenge is to achieve good performance on two user-item interaction matrices. In KKBox, the data density is only 0.0047%, so it is extremely sparse and renders various traditional recommendation methods unusable. MF and AFM are recommendation methods based on matrix operations and have obtained surprising results when processing highly sparse KKBox data. RippleNet achieves significant improvements over MF and AFM by integrating the KG into user-item interactions, but its predictive results are not as appreciable as those of the path-based MEIRec and KPRN considering the high sparseness when embedding a KG. Our proposed PeRN substantially outperforms the state-of-the-art methods and shows the best performance in several binary classification recommendation evaluation metrics. In IM-1M, as the data density is 1.87% here, PeRN and several baselines achieve slightly better results in terms of the P, R, and F1 values. In addition, the original rating scores in IM-1M are displayed in the normalized range {1, 2, 4, 5} rather than 1 for positive and 0 for negative, so the actual classification effects are not as excellent as expected, which is shown by a lower AUC score than for KKBox. Nonetheless, our proposed PeRN still has more promising results than the other baselines.

For the top-K task and cold-start costs, as Figure 5 illustrates, our proposed PeRN also has a better performance in alleviating the cold-start issue in the top-K recommendation task than the state-of-the-art path-based method KPRN. Here, KPRN utilizes 50% of the given target user-item interactions in both the IM-1M and KKBox datasets in the process of constructing the KG, which causes a severe cold-start issue. To demonstrate the ability of PeRN to handle cold-start issues in the top-K task, we adopt the PcKG, which is shown as a percentage, to evaluate our model against the baseline. Specifically, we use PcKG = {10%, 20%, 30%, 40%, 50%} of the interactions of each user in the original dataset, which aims to ensure that all the users can be found as entities in the KG to complete the music domain and movie domain KG. Every new completion has to re-extract all the paths to confirm the accuracy of the experiment. As measured by the NDCG@K – where K = {3, 5, 10, 15} in KKBox and K = {2, 6, 10, 12} in IM-1M considering that the size of the paths is different in each dataset – our PeRN is able to achieve a much more promising performance. In KKBox, with only a 20%-30% PcKG, it can reach a higher accuracy in the top-K recommendation task than KPRN with a 50% PcKG, while this number normally decreases to 10%-20% in IM-1M. In other words, we use only 30% of the prior knowledge on average and attain a better prediction result in the top-K recommendation task. Nevertheless, we must admit that these results should increase with a more thorough KG that remains to be designed and established. We hold a strong belief that reducing and refining prior knowledge is the key to decreasing cold-start costs.

5.3 Explainability analysis

In this section, we use a visualization example and a quantitative analysis to demonstrate PeRN’s effectiveness in enhancing the explainability.

Firstly, we randomly select a user-item interaction from the IM-1M dataset and visualize its three paths, as shown in Figure 6, to illustrate the model’s enhanced reasoning ability in a real scenario. The IDs of the user and target item are 3408 and 318, respectively. The red numbers denote the weighting score of each path. Some observations and analyses about “why user_3408 likes the movie The Shawshank Redemption” are presented below. As Figure 6 shows, three different paths between (U_id: 3408, I_id: 318) are clearly found, which can answer the question “why user_3408 likes item_318” at a shallow level. To mine more in-depth information from the paths and further differentiate the contribution of each path, we transform these three paths to metapaths by abstracting each entity to an entity type:

Clearly, these metapaths account for different user habits in the movie recommendation field. mp1 indicates that the user tends to show interest in movies with the same genre. mp2 tells us that the user may like movies with the same actors. More interestingly, we can also conclude that user_3408 may be a fan of Morgan Freeman by simple analysis of mp2. In addition, mp3 reveals that the user might have some interest in movies co-starring two famous actors. Such correlations between entity types assist us in acquiring more precise user preferences. After an overall statistical analysis by the entropy encoder and the learning process of our model, the weighting scores of the three paths are as shown in Figure 5. Specifically, the score of mp1 is slightly higher than that of mp2, while the score of mp3 is much lower than the other two. These results unambiguously explain that the events “the user likes a movie with the same actor” and “the user likes a movie of the same genre” are more common in movie recommendation. Additionally, the low weight of w3 demonstrates that the behavior “User Likes Item StarredBy Actor Co-starsWith Actor StarsIn Item” is fairly unusual compared with the other two, which is consistent with our behavioral intuitions in real life.

Fig. 6
figure 6

Visualization of the paths between selected user-item interactions

At last, we give an overall quantitative analysis about these weighting scores, as shown in Figure 7. We extract paths with the greatest weighting score in each user-item interaction. Here the number of the extracted paths is same as the number of user-item interactions. Every path here can be regarded as the most convincing explanation to the recommendation result. Evidently, in KKBox, the number of 3-hop paths are the greatest, which corresponds to a 3-hop path always contributes more to the result. This is a reasonable explanation as the closer relation is always more convincing. In IM-1M, the results are similar to those in KKBox. But number of 6-hop paths has a greater ratio, which indicates, compared to songs, people are more likely to like movies under 6-hop relations. All the above evidence demonstrates that our PeRN has a stronger explainability.

Fig. 7
figure 7

The ratios of the length of path with the greatest weight in user-item interactions. 1-hop and 2-hop paths here do not exist due to our schema graphs in Section 5.1

6 Conclusions

In this paper, a path-enhanced recurrent network (PeRN) is proposed, which shows good performance in dealing with cold-start issues and enhances the explanations of KG recommendation. We exploit information from the metapath to compensate for the shortcomings of general path-based methods and introduce a bidirectional path extraction algorithm, since path extraction is time-consuming. Extensive experiments show the well-developed explanations and effectiveness of our method. In the future, we hope to continue our work in the following research directions: (1) We hope to improve the current network model to boost its memory ability. In processing serialized data, such as the paths and metapaths in our PeRN, the memory ability of the model is just as crucial as its learning ability, which is always the most critical part in judging the quality of a model. Current researchers have paid too much attention to RNN-based models such as LSTM and GRU. We believe that memory networks, such as key-value memory networks, can also achieve excellent results, and we hope to leverage them to better remember the path information of KGs. (2) We hope to extract more information from the path. Although we analyze the relationships among different paths by using an entropy-based encoder in PeRN, we think there are more interesting correlations among different paths, such as triangular relationships, multiple relationships and relational reasoning in paths. We will mine this perspective further to improve the utilization of path information and to better excavate user preferences in KG recommendation. (3) In the past two years, research and applications based on graph neural networks (GNNs), such as RecoGCN [29] and GSN [9], have become a new focus. We plan to propose a new GNN-based method for explainable recommendation with KGs.