1 Introduction

Prior works [20, 45, 47] have shown that introducing knowledge graphs (KGs) into recommender systems (RS) can effectively improve the accuracy of recommendation and solve the problems of data sparsity and cold start, compared with the traditional recommendation methods, such as content-based methods [30] and collaborative filtering (CF)-based methods [25]. Besides, KGs have been successively used in many intelligent tasks due to their rich side information, such as question answering and information retrieval.

KG is a kind of semantic network composed of entities and relations. Its basic unit is a triple (hrt), where h and t represent the head entity and the tail entity, respectively, and r represents the relation between h and t. For example, the two triples, \(<Donald\ Trump, occupation, Politician>\) and \(<Donald\ Trump, occupation, Investor>\), mean that Donald Trump is not only a politician but also an investor. The main idea of the existing KG-aware recommendation methods [51, 55, 56] is to build a User-Item-Entity KG (UIEKG) by connecting the user’s interacted items with the entities or attributes extracted from the side KG (such as Wikidata [49], DBpedia [2], Yago [44] and Satori [37].), and then obtain the user preference representation through propagating the information of entities or attributes extracted from the UIEKG. For example, KGAT [51] updates the user preference vector through aggregating the embeddings of the neighbors, and recursively performs such updating process to capture the high-order information of neighbors. Meanwhile, the attention mechanism is used for weighting the importance of neighbors. However, directly aggregating or propagating the information of entities instead of further processing them leaves the useful information barely exploited, such as the heterogeneity information [4], which can be derived from different item clusters or entity clusters.

Exploring the focused heterogeneity could enhance the ability of the recommendation model to accurately capture the user’s fine-grained tastes. For example, as shown in Figure 1, given a user’s viewed record, a movie-related UIEKG can be built by connecting the movies to the actors, directors, and genres extracted from Satori. We group the items (movies) into \(Item\ Cluster_{science}\) and \(Item\ Cluster_{thriller}\) according to their genres, which gives these two item clusters the item-level heterogeneity in terms of movie genre. If \(Item\ Cluster_{thriller}\) is paid more attention than \(Item\ Cluster_{science}\) when encoding the different item-level heterogeneity into the user preference, which probably means that the user prefers thriller to science fiction. Similarly, we could group the entities headed with Avengers: Endgame into \(Entity\ Cluster_{actor}\) and \(Entity\ Cluster_{director}\) according to their relations, which gives the two entity clusters the entity-level heterogeneity in terms of the relation between head entity and tail entity. If \(Entity\ Cluster_{actor}\) is paid more attention than \(Entity\ Cluster_{director}\) when encoding the different entity-level heterogeneity into the user preference, which probably means that the user decided to watch Avengers: Endgame largely depending on who acted in this movie rather than who directed it. Through the case analysis, we assume that encoding the multi-level (item-level and entity-level) heterogeneity derived by multistage clustering could enhance the pertinence of user preference.

Fig. 1
figure 1

A toy example of the movie-related User-Item-Entity Knowledge Graph (UIEKG). The entities or attributes are extracted from Microsoft Satori

Based on the above assumption, we propose a Multistage Clustering-based Hierarchical Attention (McHa) model to obtain the user preference representation of knowledge graph-aware recommendation. Specifically, we first group the items and their neighboring entities in UIEKG into item clusters and entity clusters (jointly referred to as multistage clusters) according to the attributes of items (e.g., genres of movies or authors of books) and the relations between head entities and tail entities, respectively. Then, we construct the hierarchical attention layers to discriminatively aggregate the multi-level heterogeneity information derived from the multistage clusters into the user preference. Intuitively, our model can produce more focused user preference representation based on the following distinctive designs: 1) multistage clustering could produce the multi-level heterogeneity information of the items and their neighboring entities in UIEKG for encoding the user interaction intention for the characteristic items; 2) the hierarchical attention layers built by integrating attention mechanisms [61] with graph attention networks (GAT) [48] could discriminate the importance of each cluster and its elements; besides, 3) we explicitly encode the relation embedding into the entity cluster representation to enhance the heterogeneity of different entity clusters in terms of triple’s relation. Our contributions can be summarized as follows:

  • We propose a novel knowledge graph-aware recommendation model, namely McHa, to obtain the fine-grained user preference representation strengthened with the multi-level heterogeneity derived by grouping the items and their neighboring entities into multistage clusters.

  • We construct the hierarchical attention layers by integrating multi-level attention mechanisms with GAT to discriminate the contribution of each cluster to the user preference representation.

  • We demonstrate the effectiveness of our model and the positive effect of each part in McHa through the comparative experiments with the state-of-the-art baselines and the ablation analysis with its variants, respectively, on three benchmark datasets in two scenarios.

The remainder of this paper is organized as follows. In Section 2, we survey the related works on KG-aware recommendation as well as recent emerging topics of recommender systems. In Section 3, we present our model in detail. In Section 4, we show our experiments and analyze the results. Finally, we conclude our work and look forward to the work of this paper in Section 5.

2 Related work

On the one hand, we first survey the literature related to KG-aware recommendation in this section. Following [16], we divide KG-aware recommendation methods into three categories: embedding-based, path-based, and unified methods. On the other hand, the recent emerging research topics of recommender systems have also been discussed.

2.1 Knowledge graph aware recommendation

2.1.1 Embedding-based methods

The embedding-based methods [1, 19, 36, 54] generally embed the semantic information of KG into the representations of the items or users. For example, CKE [64] leverages heterogeneous network embedding and deep learning embedding approaches, to automatically extract semantic representations from multi-modal knowledge. Then, it combines collaborative filtering and knowledge embedding components into a unified framework and learns different representations jointly. MKR [50] builds several cross and compress units, which automatically share latent features and learn high-order interactions between items in recommender systems and entities in KGs. However, the connectivity in KGs is ignored in embedding-based methods, which makes it difficult to explain the recommendation results.

2.1.2 Path-based methods

Path-based methods mainly enhance the ability of recommendation model through exploring the connectivity in KGs [8, 21, 58]. For instance, HeteRec [62] uses the meta-path-based latent features to represent the connectivity between users and items along different paths. Then, a recommendation model with such latent features is defined and optimized through bayesian ranking optimization techniques. Later, FMG [65] improves the accuracy of recommendation by replacing the meta-path with the meta-graph. Moreover, to discriminate the importance of different paths, MCRec [18] is proposed to learn representations for users, items, and the meta-paths extracted through priority-based sampling. Then, the co-attention mechanism is applied to strike a balance between the meta-paths and user-item pairs to mutually improve their representations. RuleRec [33] induces rules from KGs for items and then makes recommendations based on the induced rules. Generally, path-based methods calculate the path-level similarity for items and entities by encoding the predefined paths or meta-paths. However, extracting such paths is a time-consuming and expertise-intensive process.

2.1.3 Unified methods

To fully exploit the information in KG, the unified methods [26, 38, 42, 66] are proposed to integrate the semantic and connectivity information [46]. For example, RippleNet [55] simulates the phenomenon of water wave energy propagation and propagates user preference over the set of KG entities by automatically and iteratively extending the user’s potential interests along with the relations. KGCN [56] is an end-to-end framework that discovers both high-order structure information and semantic information of the KG and then considers the neighborhood information when calculating the representation of a given entity. KGAT [51] propagates the embeddings from the node’s high-order neighbors to the central node, and employs an attention mechanism to discriminate the importance of neighbors. Recently, MVIN [46] is proposed to learn the item representation from both user-view and entity-view through a novel wide and deep GCN. The unified recommendation methods have become a popular trend to fully exploit the information of KGs [23, 28].

2.1.4 Summary

Through investigating the related works on KG-aware recommendation, we find that the multi-level heterogeneity hidden in the items and their neighboring entities, which preserves user’s fine-grained interests, remains barely explored by existing methods. To fill up this gap, we propose McHa to provide new insight into exploring more information in KG. To the best of our knowledge, McHa is the first method to exploit the multi-level heterogeneity information through aggregating the representations of multistage clusters and their elements with the hierarchical attention layers.

2.2 Other topics of recommendation

2.2.1 Community detection for recommendation

Community detection is to discover subgraphs from a network where the nodes share similar characteristics as well as patterns [29, 32]. It has been applied to many tasks, such as recommender systems, biochemistry, and online social network analysis, etc [43]. In the recommender systems, users with similar interests or preferences can be treated as members of a community. Detecting heterogeneous communities can help recommender systems capture users’ differentiated preferences and thus provide personalized recommendations. For example, Eissa et al. [10] proposes a novel recommendation model based on interest-based communities generated from topic-based attributed social networks. SimClusters [40] is a novel recommendation algorithm based on the bipartite communities detected via Metropolis-Hastings sampling technology. Recently, LA-ALS [35] is proposed based on the Louvain’s community detection algorithm and alternating least square algorithm. Specifically, Louvain’s community detection algorithm is used to recognize the relationship between users to enhance the ability of the recommendation model.

2.2.2 Explainable recommendation

Explainable recommendations [33, 52, 59] have attracted increasing attention as they could improve the persuasiveness of recommendation results. The advances of KGs have made it possible to provide explainable recommendations through integrating graph embedding learning and recommendation techniques [9]. Within this field, KPRN [52], PGPR [58], and PeRN [22] perform reasoning over the paths extracted from KGs to improve the causal inference of recommendations with interpretability. Further, Xie et al. [60] design a novel multi-objective optimization function to jointly optimize the precision, diversity, and explainability of recommendations. Besides, some researchers have tried to derive interpretability from auxiliary information, such as attribute [5], aspect [17], and sentiment [63] etc. For example, AMCF [60] incorporates a novel feature mapping approach to map the uninterpretable general features onto the interpretable aspect features. Another important line of research is to introduce attention mechanisms into RS to explore the interpretability reflected by the discriminative attention weights. For example, to provide explanations tailored for different target items, Seo et al. [41] and Chen et al. [6] adopt attention mechanisms to derive the importance of different review sentences under the supervision of user-item rating information.

2.2.3 Fairness in recommendation

Recently, research on fair recommendation has drawn a growing interest. There are some efforts [12, 15] on alleviating the unfairness problem of RS. For example, Fu et al. [11] quantify the unfairness in terms of KG path diversity as well as the recommendation performance disparity. Then, a fairness-aware algorithm is proposed so as to produce high-quality explainable recommendations with fairness. Mansoury et al. [34] propose a graph-based algorithm, namely FairMatch, for improving recommendation fairness. It maintains the recommendation lists updated with the items that are rarely recommended yet are high-quality. However, with the change of item popularity and user engagement, such fairness-aware methods can not cope with the dynamic fairness problem. To address this limitation, Ge et al. [14] propose FCPO to capture the long-term dynamic fairness through a fairness-constrained reinforcement learning framework. In detail, they leverage the Constrained Policy Optimization (CPO) with adapted neural network architecture to automatically learn the optimal policy under different fairness constraints.

Fig. 2
figure 2

Graphical depictions of (1) the overview of KG-aware recommendation task in the left part, (2) the framework of McHa in the right part. The abbreviations in this figure are described as follows: KGEL is the knowledge graph embedding layer, EAL is the entity-level attention layer, REL is the relation enhancing layer, ECAL is the entity cluster-level attention layer, IAL is the item attention layer, and ICAL is the item cluster-level attention layer. \(\mathbf {e}_{t_1}\) stands for the embedding vector of tail entity \(t_1\). \(\mathbf {e}_{{EC}_{r_2}^{v_2}}\) is the representation vector of entity cluster, in which the tail entities share the same head entity (item) \(v_2\) and relation \(r_2\). \(\mathbf {e}_{v_2}\) is the representation vector of item \(v_2\). \(\mathbf {e}_{{IC}_{a_2}^{u}}\) denotes the representation vector of item cluster, in which the items have been interacted with the same user u and share the same attribute \(a_2\). \(\mathbf {u}\) represents the final user preference vector

3 Methodology

3.1 Problem formulation

In knowledge graph-aware recommendation, we let \(\mathcal {U}=\{u_1, u_2, \cdots ,u_{|\mathcal {U}|} \}\) and \(\mathcal {V}=\{v_1,v_2, \cdots ,v_{|\mathcal {V}|}\}\) denote the user set and item set, respectively. The user-item interaction matrix is represented as \(\mathcal {Y}=\{y_{uv}|u \in \mathcal {U},v \in \mathcal {V}\}\), where

$$\begin{aligned} y_{uv}=\left\{ \begin{array}{rcl} 1 , &{} \text {if }u \text { has an interaction with } v;\\ 0 , &{} \text {otherwise}.\\ \end{array} \right. \end{aligned}$$
(1)

\(y_{uv}=1\) means that the user u has an implicit interaction with the item v, such as clicking, watching and browsing, etc. Additionally, we have a side knowledge graph \(\mathcal {KG}\), which is comprised of triples (hrt). Here, h, r, and t represent the head entity, relation, and tail entity, respectively. Given the input user u, input candidate item v, user-item interaction matrix \(\mathcal {Y}\), and knowledge graph \(\mathcal {KG}\), the goal of our model is to train a prediction model \(\hat{y}_{uv} = \mathcal {F}(u,v)\) to predict the probability \(\hat{y}_{uv}\) that the user u would adopt the candidate item v.

In detail, as shown in the left part of Figure 2, for each input user u \(\in\) \(\mathcal {U}\), we can obtain the interaction record of user u by looking up the user-item interaction matrix \(\mathcal {Y}\). The interaction record \(\mathcal {I}\) can be formulated as

$$\begin{aligned} \mathcal {I}=\{v_1,\cdots , v_i,\cdots ,v_{|\mathcal {I}|} \},\ v_i \in \mathcal {V}\ \text {and}\ y_{uv_i}=1. \end{aligned}$$
(2)

We link all items in the interaction record \(\mathcal {I}\) to the entities or attributes of the side KG to generate UIEKG \(\mathcal {G}\). Then, the UIEKG is fed into the user preference capturing model to calculate the final preference representation (denoted by \(\mathbf {u}\)) for the input user u. Accordingly, we could feed the input candidate item v into the knowledge graph embedding layer to obtain the candidate item representation (denoted by \(\mathbf {v}\)). After that, we calculate the probability \(\hat{y}_{uv}\) by inputting \(\mathbf {u}\) and \(\mathbf {v}\) into a mapping function \(f: \mathbb {R}^k \times \mathbb {R}^k \rightarrow \mathbb {R}\):

$$\begin{aligned} \hat{y}_{uv} = f(\mathbf {u}, \mathbf {v}). \end{aligned}$$
(3)

Capturing user preference is the most important part of the knowledge graph-aware recommender systems. An ideal recommendation model should capture the user’s potential interests as accurately as possible. To achieve this, we propose a new KG-aware recommendation model, namely McHa (depicted in the right part of Figure 2). Different from that the existing KG-aware recommendation methods directly aggregate neighboring entities to the central item and aggregate items to the user, we additionally deploy the entity cluster-level attention layer between the neighboring entities and the central item and the item cluster-level attention layer between the interaction items and the user to capture the user’s more fine-grained potential interests. We will present our model detailedly in the following Sections 3.2-3.8.

3.2 Knowledge graph embedding layer (KGEL)

As shown in Figure 2, knowledge graph embedding (KGE) layer is to represent entities and relations as vectors to preserve the structural and semantic information in KG. Many attempts have been made for KGE, such as TransE [3], TransH [53], and TransR [27], etc. In our work, we use TransR to learn the embeddings of the entities and relations in KG because of its superiority in dealing with the multi-relational space projection between head entity and tail entity. Semiotically, we let \(\mathbf {e}_h\), \(\mathbf {e}_r\), \(\mathbf {e}_t\) denote the embeddings of h, r, t for a triple (hrt) in UIEKG, respectively. The embeddings \(\mathbf {e}_h\) and \(\mathbf {e}_t\) in the entity space are projected into the relation space by the r-aware parameter \(\mathbf {W}_r\):

$$\begin{aligned} \mathbf {e}_h^r = \mathbf {W}_r \mathbf {e}_h,\ \ \mathbf {e}_t^r = \mathbf {W}_r \mathbf {e}_t, \end{aligned}$$
(4)

where [\(\mathbf {e}_h\), \(\mathbf {e}_t\)] \(\in\) \(\mathbb {R}^d\), \(\mathbf {e}_r\) \(\in\) \(\mathbb {R}^{d_r}\) and \(\mathbf {W}_r\) \(\in\) \(\mathbb {R}^{d_r \times d}\). According to the principle of TransR, we have “\(h + r \approx t\)”, which means that h can be translated into t through the bridge r. Therefore, the energy score of the triple (hrt) can be evaluated by

$$\begin{aligned} f(h,r,t) = \left\| \mathbf {e}_h^r + \mathbf {e}_r - \mathbf {e}_t^r \right\| _2^2. \end{aligned}$$
(5)

A lower score of f(hrt) means that the head entity and its tail entity are more closely in the relation space. It should be noted that the items in UIEKG can be regarded as entities when performing knowledge graph embedding.

3.3 Entity-level attention layer (EAL)

3.3.1 Entity cluster extraction

As shown in Figure 2, given an interaction item v \(\in\) \(\mathcal {I}\), we can extract many triples that equip the item v as the head entity from the UIEKG. The extracted triples share the same head entity, meanwhile have different relations. We group these tail entities that share the same item (head entity) into several entity clusters (e.g. \(Entity\ Cluster_{actor}\) and \(Entity\ Cluster_{director}\)) according to their different relations for exploring heterogeneity information in terms of triple’s relation. Entity cluster can be defined as:

Definition 1

Entity Cluster (\({EC^v_r}\)): A group of entities that share the same head entity v under relation r:

$$\begin{aligned} {EC^v_{r}} = \{ t_1, t_2, t_3,\cdots , t_{|{EC^v_r}|}|\}. \end{aligned}$$
(6)

Grouping entities according to their relations in this layer can allow our model to purposefully capture the user’s preference for the items with more subdivided characteristics.

3.3.2 Obtain entity cluster representation

After extracting entity cluster, we obtain the entity cluster representation through aggregating the elements in each entity cluster with the entity-level attention weights generated by GAT [48]. Specifically, we first obtain the attention score s(t) for each element by

$$\begin{aligned} s(t) = \mathrm{LeakyReLU}(\mathbf {W}_{EAL_2} \cdot [\mathbf {W}_{EAL_1}\mathbf {e}_h||\mathbf {W}_{EAL_1}\mathbf {e}_t]), \end{aligned}$$
(7)

where \(\mathbf {W}_{EAL_1} \in \mathbb {R}^{d_r \times d}\) and \(\mathbf {W}_{EAL_2} \in \mathbb {R}^{1 \times 2d_r}\) are the learning parameters for the feature augmenting and \([\cdot ||\cdot ]\) is the concatenating operation for two vectors. A single layer perceptron with LeakyReLU activation function is applied to map the latent vector \([\mathbf {W}_{EAL_1}\mathbf {e}_h||\mathbf {W}_{EAL_1}\mathbf {e}_t]\) to the real number s(t) . We chose LeakyReLU activation function since it attempts to fix the “dying ReLU problem” [31] with a small negative slope instead of zero when the input value < 0. By normalizing the attention score s(t) via softmax function, we get the attention weight:

$$\begin{aligned} \alpha (t) =\frac{{\exp }(s(t))}{\sum _{t \in {EC^v_r}}\exp (s(t))}. \end{aligned}$$
(8)

The entity-level attention weight \(\alpha (t)\) suggests which neighboring tail entities should be paid more attention when capturing the collaborative information. Finally, we obtain the entity cluster representation through aggregating the embedding vectors of all elements in entity cluster \(EC^v_{r}\):

$$\begin{aligned} \mathbf {e}_{{EC^v_{r}}} =\sum _{t \in {EC^v_r}}\alpha (t)\mathbf {W}_{EAL_1}\mathbf {e}_{t}. \end{aligned}$$
(9)

\(\mathbf {e}_{{EC^v_{r}}}\) is the final entity cluster representation that preserves the heterogeneity information in terms of triple relation r.

3.4 Relation enhancing layer (REL)

According to Definition 1, the entities headed with item v can be grouped into different entity clusters according to different relations, which gives each entity cluster different heterogeneity in terms of triple’s relation. To further enhance the heterogeneity of each entity cluster, we explicitly encode the embedding of relation r into the entity cluster representation \(\mathbf {e}_{{EC^v_r}}\). As shown in Figure 2, the relation enhancing process can be formulated as

$$\begin{aligned} \mathbf {e}'_{{EC^v_r}}= & {} \mathbf {e}_{{EC^v_r}} \odot \mathbf {e}_r, \end{aligned}$$
(10)
$$\begin{aligned} \mathbf {e}^*_{{EC^v_{r}}}= & {} \sigma (\mathbf {W}_{REL} \mathbf {e}'_{{EC^v_r}}). \end{aligned}$$
(11)

We first obtain the latent representation \(\mathbf {e}'_{{EC^v_r}} \in \mathbb {R}^{d_r}\) by performing the element-wish projection between the entity cluster representation \(\mathbf {e}_{{EC^v_r}}\) and the relation embedding \(\mathbf {e}_r\). Then, we use a fully connected layer with sigmoid activation function to compress the latent representation \(\mathbf {e}'_{{EC^v_r}}\) into \(\mathbf {e}^*_{{EC^v_r}}\). \(\mathbf {W}_{REL} \in \mathbb {R}^{d_r \times d}\) is the learning parameter. \(\mathbf {e}^*_{{EC^v_r}} \in \mathbb {R}^d\) is the final entity cluster representation enhanced with the relation information.

3.5 Entity cluster-level attention layer (ECAL)

As depicted in the right part of Figure 2, we assume that the entities headed with the item v can be grouped into several entity clusters according to different relations, which can be formulated as

$$\begin{aligned} S_{v} = \{ EC^v_{r_1}, EC^v_{r_2}, EC^v_{r_3},\cdots ,EC^v_{r_{|S_{v}|}} \}, \end{aligned}$$
(12)

where \(S_{v}\) is the entity cluster set of item v. Not all entity clusters equally contribute to the central item representation. For example, if the desire of the user to watch a movie largely depends on who acted in this movie rather than who directed it, the entity cluster \(Entity\ Cluster_{actor}\) should be paid more attention than \(Entity\ Cluster_{director}\) in the process of capturing the user’s preference.

Motivated by the above analysis, we calculate the representation of item v by differently aggregating the representations of all entity clusters of item v. In detail, we apply the entity cluster-level attention mechanism to discriminate those informative and uninformative entity clusters, which can be formulated as

$$\begin{aligned} \mathbf {e}_{v}= & {} \sum _{{EC^v_r} \in S_{v}}\alpha ({EC^v_r})\mathbf {e}^*_{{EC^v_r}}, \end{aligned}$$
(13)
$$\begin{aligned} \alpha ({EC^v_r})= & {} \frac{\exp (\mathbf {s}({EC^v_r})^\mathrm {T} \mathbf {s}_{ECAL})}{\sum _{{EC^v_r} \in S_{v}}\exp (\mathbf {s}({EC^v_r})^\mathrm {T} \mathbf {s}_{ECAL})}, \end{aligned}$$
(14)
$$\begin{aligned} \mathbf {s}({EC^v_r})= & {} \mathrm{tanh}(\mathbf {W}_{ECAL} \mathbf {e}^*_{{EC^v_r}} + \mathbf {b}_{ECAL}). \end{aligned}$$
(15)

Inspired by [61], we first utilize a single-layer feedforward neural network with the tanh activation function to calculate the hidden representation \(\mathbf {s}({EC^v_r})\) of the entity cluster \(EC^v_r\). \(\mathbf {W}_{ECAL} \in \mathbb {R}^{d' \times d}\) and \(\mathbf {b}_{ECAL} \in \mathbb {R}^{d'}\) are the learning weight matrix and bias, respectively. We chose tanh activation function since it can solve the non zero-centered problem of popular sigmoid function by squashing a real-valued number to the range [-1, 1]. Then, the entity cluster-level attention weight \(\alpha ({EC^v_r})\) is calculated by normalizing the projection between \(\mathbf {s}({EC^v_r})\) and \(\mathbf {s}_{ECAL} \in \mathbb {R}^{d'}\) via softmax function. \(\mathbf {s}_{ECAL}\) can be regarded as the entity cluster-level context vector. Finally, we calculate the item representation \(\mathbf {e}_{v}\) by aggregating all \(\mathbf {e}^*_{{EC^v_r}}\) in \(S_{v}\) with the entity cluster-level attention weights. The vector \(\mathbf {e}_{v}\) is the final representation of item v that summarizes the information of all entity clusters that equip this item as the head entity.

3.6 Item-level attention layer (IAL)

3.6.1 Item cluster extraction

Similar to the entity cluster extraction, we group the items in the user’s interaction record I into different item clusters according to their attributes (e.g., genres for movies). Item cluster can be defined as:

Definition 2

Item Cluster (\({IC^u_a}\)): A group of items that share the same user u and attribute a:

$$\begin{aligned} {IC^u_a} = \{ v_1, v_2, v_3,\cdots , v_{|{IC^u_a}|}\}. \end{aligned}$$
(16)

The purpose of grouping items into different item clusters in this layer is to allow our model to exploit the heterogeneity of different item clusters in terms of the item’s attribute and strengthen the pertinence of the user preference.

3.6.2 Obtain item cluster representation

As presented in the right part of Figure 2, to obtain the representation of item cluster \({IC^u_a}\), we aggregate the representations of all items in \({IC^u_a}\) based on the item-level attention mechanism, which can be formulated as

$$\begin{aligned} \mathbf {e}_{{IC^u_{a}}}= & {} \sum \limits _{v \in {IC^u_a}}\alpha (v)\mathbf {e}_{v}, \end{aligned}$$
(17)
$$\begin{aligned} \alpha (v)= & {} \frac{\exp (\mathbf {s}(v)^\mathrm {T} \mathbf {s}_{IAL})}{\sum _{v \in {IC^u_a}}\exp (\mathbf {s}(v)^\mathrm {T} \mathbf {s}_{IAL})}, \end{aligned}$$
(18)
$$\begin{aligned} \mathbf {s}(v)= & {} \mathrm{tanh}(\mathbf {W}_{IAL} \mathbf {e}_{v} + \mathbf {b}_{IAL}), \end{aligned}$$
(19)

where \(\mathbf {W}_{IAL} \in \mathbb {R}^{d' \times d}\), \(\mathbf {b}_{IAL} \in \mathbb {R}^{d'}\) and \(\mathbf {s}_{IAL} \in \mathbb {R}^{d'}\) are the learning parameters. \(\alpha (v)\) is the item-level attention weight. The vector \(\mathbf {e}_{{IC^u_a}}\) is the item cluster representation that summarizes the information of all items in the item cluster \({IC^u_a}\).

3.7 Item cluster-level attention layer (ICAL)

In this layer, the interaction items of the user u can be grouped into different item clusters, which can be formulated as

$$\begin{aligned} S_{u} = \{ IC^u_{a_1}, IC^u_{a_2}, IC^u_{a_3},\cdots ,IC^u_{a_{|S_{u}|}} \}, \end{aligned}$$
(20)

where \(S_{u}\) is the item cluster set of user u. As we discussed in Section 1, user may have different interests for different movie clusters. Therefore, the user preference representation \(\mathbf {u}\) can be obtained by discriminatorily aggregating the representations of all item clusters of user u based on the item cluster-level attention mechanism, which can be formulated as

$$\begin{aligned} \mathbf {u}= & {} \sum _{{IC^u_a} \in S_{u}}\alpha ({IC^u_a})\mathbf {e}_{{IC^u_a}},\end{aligned}$$
(21)
$$\begin{aligned} \alpha ({IC^u_a})= & {} \frac{\exp (\mathbf {s}({IC^u_a})^\mathrm {T} \mathbf {s}_{ICAL})}{\sum _{{IC^u_a} \in S_{u}}\exp (\mathbf {s}({IC^u_a})^\mathrm {T} \mathbf {s}_{ICAL})}, \end{aligned}$$
(22)
$$\begin{aligned} \mathbf {s}({IC^u_a})= & {} \mathrm{tanh}(\mathbf {W}_{ICAL} \mathbf {e}_{{IC^u_a}} + \mathbf {b}_3), \end{aligned}$$
(23)

where \(\mathbf {W}_{ICAL} \in \mathbb {R}^{d' \times d}\), \(\mathbf {b}_{ICAL} \in \mathbb {R}^{d'}\) and \(\mathbf {s}_{ICAL} \in \mathbb {R}^{d'}\) are the learning parameters. \(\alpha ({IC^u_a})\) is the item cluster-level attention weight. The vector \(\mathbf {u}\) is the final user preference representation.

3.8 Probability prediction

So far, we have obtained the final user preference representation \(\mathbf {u}\) of user u. Given a candidate item v, we feed it into the knowledge graph embedding layer to obtain the candidate item representation \(\mathbf {v}\). Hereafter, the probability \(\hat{y}_{uv}\) that user u would adopt candidate item v is calculated by feeding \(\mathbf {u}\) and \(\mathbf {v}\) into the following equation:

$$\begin{aligned} \hat{y}_{uv}=\sigma (\mathbf {u}^ \mathrm { T } \mathbf {v}), \end{aligned}$$
(24)

where \(\sigma (\cdot )\) is the sigmoid function. \(\hat{y}_{uv}\) is the final output of our model.

3.9 Learning algorithm

3.9.1 Loss function

In the training process of knowledge graph embedding (KGE), we learn the embeddings of entities and relations in UIEKG \(\mathcal {G}\) by optimizing the BPR [39] loss with \(L_2\) regularization, which can be formulated as

$$\begin{aligned} {\mathcal {L}}_\mathrm{KGE} = \underset{\underset{(h,r,t') \notin \mathcal {G}}{(h,r,t) \in \mathcal {G}}}{\sum }-\ln \sigma (f(h,r,t') - f(h,r,t)) + \lambda \left\| \varTheta _\mathrm{KGE} \right\| ^2_2. \end{aligned}$$
(25)

\({\mathcal {L}}_\mathrm{KGE}\) is the knowledge graph embedding loss. In detail, the first term is the BPR loss, where \((h,r,t')\) is the negative triple generated by negative sampling for tail entity, and \(f(\cdot )\) (see (5)) is the energy function for evaluating the plausibility of a triple. The second term is the \(L_2\) regularizer with the coefficient \(\lambda\) for preventing overfitting and \(\varTheta _\mathrm{KGE}\) (including \(\mathbf {W}_r\)) stands for the parameter set for training KGE. \(\mathcal {G}\) stands for the user-item-entity knowledge graph.

In the training process of recommendation model (RM), we adopt the cross-entropy loss with \(L_2\) regularization to optimize the learning parameters, which can be formulated as

$$\begin{aligned} {\mathcal {L}}_\mathrm{RM}=&-\sum _{(u, v) \in \mathcal {P}}(y_{uv}\log (\hat{y}_{uv})+(1-y_{uv})\log (1-\hat{y}_{uv})) + \lambda \left\| \varTheta _\mathrm{RM} \right\| ^2_2. \end{aligned}$$
(26)

\({\mathcal {L}}_\mathrm{RM}\) is the recommendation loss. In detail, the first term is the cross-entropy loss, where \(\mathcal {P}\) stands for the mixed training interactions including the observed interactions and the unobserved (negative) interactions generated by negative sampling strategy, and \(\hat{y}_{uv}\) (see (24)) is the CTR probability. The second term is the \(L_2\) regularizer with the coefficient \(\lambda\) and \(\varTheta _\mathrm{RM}\) (including \(\mathbf {W}_{EAL_1}\), \(\mathbf {W}_{EAL_2}\), \(\mathbf {W}_{REL}\), \(\mathbf {W}_{ECAL}\), \(\mathbf {b}_{ECAL}\), \(\mathbf {s}_{ECAL}, \cdots\)) stands for the parameter set for training recommendation model.

3.9.2 Training strategy

Inspired by [51], we optimize \({\mathcal {L}}_\mathrm{KGE}\) and \({\mathcal {L}}_\mathrm{RM}\) alternatively with the widely used optimizer-Adam [24]. We chose Adam since it keeps the learning rate adaptive. The learning algorithm of our model is presented in Algorithm 1. For every epoch of training, we perform KGE training (corresponding to lines 3-8) and recommendation model training (corresponding to lines 9-24) alternately.

figure a

4 Experiments

4.1 Datasets

We choose the following three widely used benchmark datasets of recommendation tasks to evaluate our model.

  • MovieLens-1MFootnote 1 is a movie rating dataset widely used in recommendation task. It includes ratings (ranging from 1 to 5) for movies and demographic data (age, gender, and occupation, etc.) about users.

  • Last.FMFootnote 2 is a dataset collected from an online music website for providing music recommendations. This dataset includes the listened artist records of users and the metadata about users and artists.

  • Book-CrossingFootnote 3 is a book rating dataset from Book-Crossing community. It includes the ratings (ranging from 0 to 10) for books and the metadata about users and books.

The statistics of the three benchmark datasets are shown in Table 1. As suggested in [46, 55], we convert the ratings in MovieLens-1M and Book-Crossing into binary feedback. Each entry is marked as 1 if the item had been positively rated by the user. Practically, the rating threshold of MovieLens-1M is set to 4, which means that if the rating score is not smaller than 4, the entry is marked as 1. While no threshold is set for Book-Crossing due to its sparsity of interactions, which means that if the entry is observed, it is marked as 1. For Last.FM, the user-artist entry is marked as 1 if it is recorded in the listened artist records. For the three benchmark datasets, these entries marked as 1 are regarded as the observed interactions. Accordingly, we randomly sample unobserved interactions marked as 0 for each user, which is of equal size with the observed interactions. We split the mixed interactions including the observed and unobserved interactions into training, validation, and test datasets with the ratio of 6:2:2. We train our model on the training data, tune hyper-parameters on the validation data, and evaluate the performance of our model on the test data. Following [46, 55, 56], we use Microsoft SatoriFootnote 4 to construct the UIEKG for each dataset. Specifically, we link the items to the entities by matching their names with the confidence level > 0.9. For MovieLens-1M and Last.FM, we group the interaction items (movies and artists) of the user into item clusters according to their genres, while for Book-Crossing, we group the items (books) into item clusters according to their authors.

Table 1 Statistics of the three benchmark datasets. # stands for the number

4.2 Baselines

We choose the following representative or state-of-the-art models as baselines:

  • SVD++ [25] is an improved version of Singular Value Decomposition (SVD), which considers the user’s implicit feedback to the item.

  • CKE [64] is a unified framework that combines collaborative filtering with knowledge base embedding to learn different representations jointly.

  • MKR [50] builds several cross and compress units, which automatically share latent features and learn high-order interactions between items in recommender systems and entities in the knowledge graph.

  • KGCN [56] captures inter-item relatedness effectively by mining their associated attributes in KG. Besides, it samples from the neighbors for each entity in KG and then combines the neighborhood information when calculating the representation of a given entity.

  • KGAT [51] is a model that propagates the embeddings from the node’s high-order neighbors to the central node and employs an attention mechanism to discriminate the importance of the neighbors.

  • MVIN [46] improves representations of items from both the user view gathering personalized knowledge information and the entity view considering the difference among layers.

  • RippleNet [55] propagates user preferences over the set of entities by extending a user’s potential interests along links extracted from KG.

  • FairGo [57] is a model-agnostic framework, which considers fairness from a user-item bipartite graph perspective. In detail, it eliminates the unfairness through a graph-based adversarial training process.

It should be noted that the hyper-parameters of baselines are set to the default or recommended parameters in the published literature.

Table 2 Hyper-parameter settings of McHa

4.3 Experiment setup

4.3.1 Hyper-parameters

The hyper-parameter settings are listed in Table 2. In detail, d and \(d_r\) stand for the embedding dimension of entity and relation, respectively, and \(d'\) is the dimension of the context vector. We let |EC| and \(|S_{v}|\) denote the number of entity elements in each entity cluster and the number of entity clusters in each \(S_{v}\), respectively. Similarly, |IC| and \(|S_{u}|\) are the number of item elements in each item cluster and the number of item clusters in each \(S_{u}\), respectively. It should be noted that the size of EC , \(S_{v}\), IC, and \(S_{u}\) are not fixed for each user. As suggested in [55], we apply the sampling strategy to fix these unfixed sizes for every user. \(\lambda\) is the regularization coefficient. The batch size and learning rate are set to 128 and 0.001 for both KGE and recommendation training. The hyper-parameters given in this paper are selected by grid search.

4.3.2 Evaluation metrics

For CTR prediction task, we use the metrics of AUC, ACC and F1-score to evaluate the performance of our model. For top-N recommendation task, we adopt the metrics of Precision@N, Recall@N and F1-score@N to evaluate the ability of our model in selecting N highest click probability items for the user. Each experiment is repeated 5 times, and the average results (mean) with standard deviation (std) on the test dataset are reported.

Table 3 Results (\(Mean\pm \scriptstyle std\) of testing 5 times) for the CTR prediction task

4.4 Results and discussion

We evaluate our model in two recommendation tasks: (1) CTR prediction, and (2) top-N recommendation. We have the following observations.

4.4.1 CTR prediction task

As shown in Table 3, our model has achieved the best performance in the CTR prediction task, compared with baselines. Specifically, the performance has been averagely improved by 2.3%, 5.0%, and 10.7% of F1-score on MovieLens-1M, Last.FM, and Book-Crossing, respectively. Compared with our model, KGCN, KGAT, MVIN, and RippleNet achieve poorer performances probably because the noisy information of the irrelevant high-order nodes could be unintentionally introduced and amplified step by step during the information propagation in these methods. CKE does not perform well when missing the visual embedding, compared with other KG-aware methods. Besides, due to the lack of external information, SVD++ achieves poorer performance, especially in face of sparser data (e.g., Last.FM and Book-Crossing). Although FairGo attempts to improve the recommendation performance via mitigating the unfairness issue, it doesn’t perform well compared with KG-aware methods due to the lack of external information provided by KG.

4.4.2 Top-N recommendation task

As shown in Figure 3, our model has also achieved the best performance compared with baselines. Given the fact that Last.FM is a smaller dataset than MovieLens-1M and Book-Crossing, the outstanding improvement of our model performance on this dataset indicates that our model has strong adaptability when facing a smaller dataset in top-N recommendation task.

Fig. 3
figure 3

Results (Mean of testing 5 times) for the top-N recommendation task

4.5 Ablation study

4.5.1 Ablation setup

In this part, we conduct the ablation experiment to prove the positive effect of every attention layer in McHa. Experimentally, we perform the ablation by replacing every attention layer of McHa with the single-layer feedforward neural network with tanh activation function. For the ablation of relation enhancing layer, we only eliminate \(\mathbf {e}_r\) in (10). We use abbreviations to represent McHa’s variants. For example, we let the abbreviation “\(\text {McHa}_{\text { w/o EAL}}\)” denote McHa with the ablation of Entity-level Attention Layer (EAL).

Table 4 Ablation study results (\(Mean\pm std\) of testing 5 times). “w/o” means without. The best results are reported in boldface

4.5.2 Ablation results

As shown in Table 4, McHa outperforms all variants. This observation demonstrates that every attention layer of our proposed framework has an essential and positive contribution to the performance of our model. Specifically, \(\text {McHa}_{\text {w/o ECAL}}\) (McHa without Entity Cluster-level Attention Layer) and \(\text {McHa}_{\text {w/o ICAL}}\) (McHa without Item Cluster-level Attention Layer) achieve the poorer performance than other variants, which indicates that multistage clustering plays a significant positive role in capturing the user’s preference.

Fig. 4
figure 4

Parameter sensitivity for embedding dimension d, number of entity clusters \(|S_{v}|\), and number of item clusters \(|S_{u}|\). The other hyper-parameters are fixed according to Table 2

4.6 Parameter sensitivity analysis

4.6.1 Embedding dimension

We vary \(d \in [4, 8, 16, 32, 64, 128]\) to study the influence of dimension in the knowledge graph embedding layer. As shown in Figure 4(a), increasing the dimension boosts the performance since a high-dimensional vector preserves more information. While, if the embedding dimension is too large, the model suffers from overfitting.

4.6.2 Number of entity clusters

We vary \(|S_{v}| \in [2,3,4,5,6,7]\) to verify the influence of the number of entity clusters. As shown in Figure 4(b), performance would deteriorate when setting \(|S_{v}|\) to the smaller or greater values than the ideal value. This observation can be explained as that a smaller or greater size of \(S_{v}\) generated by the negative sampling strategy would lead to the loss of information or the introduction of noisy entities, respectively. This observation implies that the performance of our model is sensitive to the number of entity clusters in \(S_{v}\) and grouping the entities into different entity clusters in our proposed model is effective for capturing the user’s fine-grained preferences.

4.6.3 Number of item clusters

We vary \(|S_{u}| \in [2,3,4,5,6,7]\) to verify the influence of the number of item clusters. As shown in Figure 4(c), our model achieves the best results when setting \(|S_{u}|\) to 4, 5, and 4 for MovieLens-1M, Last.FM, and Book-Crossing, respectively. This observation indicates that the performance of our model is sensitive to the number of item clusters in \(S_u\). In another word, properly grouping the items into different item clusters contributes to the recommendation performance positively.

4.7 Interpretability with case study

Prior works [7, 13] have shown that the attention mechanism can benefit and explain the recommendation results. On this basis, we provide a visual case to intuitively explain the recommendation results of our model. We randomly sample a user (User ID: 9) from MovieLens-1M. As shown in Figure 5, movies in this user’s viewed record extracted from MovieLens-1M are grouped into four item clusters by our model according to their genres. This case shows that \(Item\ Cluster_{Comedy}\) and \(Item\ Cluster_{Animation}\) are assigned with the biggest and smallest attention weight when calculating the user preference representation, respectively. This means that the fine-grained and focused information that the user is more interested in comedy rather than animation would be encoded into the user preference representation. To verify the effectiveness of such user preference, we feed two new candidate movies The Tigger Movie (an animation) and A League of Their Own (a comedy) into our model to calculate their CTR probabilities, respectively. The output results for these two candidate movies show that comedy A League of Their Own received a higher CTR probability (0.857) than animation The Tigger Movie (0.092), which demonstrates that the user preference captured by our model works. In summary, this case implies that our model could accurately generate the expressive user preference representation and the recommendation results can be explained by the attention weights.

Fig. 5
figure 5

A real case (User ID: 9) from MovieLens-1M

5 Conclusion and future work

In this paper, we propose a novel KG-aware recommendation model, namely McHa. It overcomes the limitation that the more fine-grained and focused multi-level heterogeneity information remains barely exploited in existing methods. Specifically, we first capture the multi-level heterogeneity information by grouping the items and their connected entities into item clusters and entity clusters (jointly referred to as multistage clusters), respectively. Then, the user preference is obtained by hierarchically aggregating the multi-level heterogeneity information with the weights generated by the hierarchical attention layers. The extensive experiments show the effectiveness of our model.

However, further research is still needed. For example, we only consider the nearest (1-hop) entities around the item in the knowledge graph. How to extend our model to efficiently process the multi-hop entities needs to be further studied to explore more potential information in the knowledge graph.