1 Introduction

Recommender systems are one of the most widely used applications by many on-line services, such as E-commerce platforms, news portals, advertising, social media sites, etc [1] for information overload problem. Generally speaking, the mainly used recommendation techniques are roughly classified as content-based and collaborative filtering-based. Since the Collaborative Filtering (CF) method has no content restriction, it is more widely used in both literature and industry [7, 8, 13, 36, 42]. In recommender systems, CF aims to predict the missing ratings for an item or a user based on the collected ratings from similar items or like-minded users. [9, 14, 32, 40]. CF has become one of the most popular methods in recommender systems at present due to its simpleness and high-efficient [39, 41], but it also faces major bottlenecks such as data sparsity [18]. The so-called data sparsity refers to the fact that too few data is observed in the user-item rating matrix, which makes the recommendation model suffers from overfitting and leads to low-quality predictions [30].

To cope with the sparsity issue, numerous improved CF methods have been proposed. Among them, Matrix Factorization (MF) gains great success during the past ten plus years. For example, Srebro, etc. [37] presented the Maximum Margin Matrix Factorization (MMMF) method, whose goal is to learn a complete observed matrix with the minimum trace norm by maximizing the prediction margin to approximate the target preference matrix. Zhang et al. [45] presented a weighted nonnegative matrix factorization (WNMF) method that decomposes the observed user-item rating matrix into two low-dimensional nonnegative matrices, and then uses their product to predict the unobserved user-item rating. Gu et al. [11] extended the WNMF method by incorporating user and item graphs. Chen et al. [6] applied Orthogonal Nonnegative Matrix Tri-Factorization (ONMTF) for collaborative filtering to alleviate the sparsity problem. Mnih et al. [24] proposed a probability matrix factorization (PMF) method, which uses a probability model with Gaussian observation noise, with the goal of maximizing the conditional distribution on the observation rating. The recommendation accuracy of these MF methods still largely depends on the target rating matrix. However, when the observed target rating matrix is very sparse, the recommendation performance would be degraded seriously.

Recently, transfer learning (TL) [25] methods have been introduced into collaborative filtering for solving the data sparsity issue [15, 22, 35, 46]. The kernel thought behind it is to extract the common latent knowledge from some dense auxiliary data via the latent factorization model, then transfer it to sparse target data [3, 33]. The existing TL-based CF methods mostly focus on selecting cluster-level codebooks [18, 19], or the latent tastes of users/latent features of items [23, 33, 35] as common knowledge for transfer. They are often limited to the transfer of homogeneous user feedbacks. However, heterogeneous user feedbacks are more common in reality. To this end, several TL-based CF methods using heterogeneous user feedback have been proposed to solve the data sparsity issue. For example, Pan et al. [27, 30, 31] explore how to utilize dense binary preference data (“like” or “dislike”) in the auxiliary domain to aid recommendations in the target domain with numerical rating data (5-star rating). However, in practical scenarios, the target rating matrix is often extremely sparse (e.g., sparsity level ≤ 0.1), at which point these methods are likely to encounter the following two limitations. First, the under-transfer issue is likely to occur as the extremely sparse target data requires more knowledge to be transferred from the auxiliary data. The so-called under-transfer means that the useful knowledge in the auxiliary data is not fully transferred to the target data. Second, the latent factors extracted from the auxiliary data are likely to transfer the negative information to the target data, resulting in the negative transfer issue.

In this paper, we investigate how to transfer knowledge effectively from dense auxiliary binary rating data to extremely sparse target numerical rating data to improve the prediction accuracy of the recommender system. To address the problem mentioned above, we proposed a new TL-based CF method, referred to as Enhanced Knowledge Transfer for Collaborative Filtering with Multi-Source Heterogeneous Feedbacks (EKT). First, we use the proposed weighted collective matrix tri-factorization framework to extract the common latent factors of users and items as well as the common partial cluster-level rating pattern, through which more knowledge can be transferred between domains. Second, we incorporate the graph co-regularization of user and item graphs into the proposed weighted collective matrix tri-factorization, which will preserve the intrinsic geometric structure in each domain, thus alleviating the negative transfer issue. Simultaneously, when the target data is extremely sparse, the neighborhood structure information of the auxiliary data can be transferred to the target data, further enhancing the transfer of knowledge. The main contributions of this paper are summarized as follows:

  • We propose a new TL-based CF framework, which integrates weighted collective matrix tri-factorization and graph co-regularization of user and item graphs in a unified framework. The proposed framework can alleviate the under-transfer and negative transfer issues that may be caused by extremely sparse target data.

  • For the proposed framework, we propose an alternative optimization procedure and further prove its convergence.

  • On two benchmark data sets, we demonstrate the effectiveness of proposed EKT method at a variety of sparsity levels of \(0.01\%\sim 1\%\), and the proposed EKT method shows better performance compared to several state-of-the-art baseline methods when the sparsity level is less than 1%.

The rest of the paper is organized as follows. We first review some related works briefly in Section 2, then describe the proposed EKT method in detail in Section 3. After that we show experiments conducted on two real-world data sets to verify the effectiveness of the proposed EKT mehod in Section 4. Finally, we give the conclusion and directions for future study in Section 5. The notations used through the paper are listed in Table 1.

Table 1 Notations

2 Related work

This paper focuses on improving the recommendation accuracy of collaborative filtering methods when rating data is extremely sparse. In this section, we will discuss the related works.

Collaborative filtering is a simple and effective recommendation method, but when the target data is very sparse, it is easy to obtain degraded prediction results [38]. In recent years, transfer learning has been introduced for solving data sparsity problems in collaborative filtering [12, 20, 33], which transfer knowledge from auxiliary data to target data. From the content of the transfer, most of the existing transfer learning methods in collaborative filtering are mainly considered from two perspectives. One is the cluster-level rating pattern transfer, which transfers the cluster-level rating pattern from the auxiliary data to the target data. For example, Bin Li et al. proposed Codebook Based Transfer (CBT) model [18] to transfer knowledge of cluster-level rating behavior from auxiliary data of movies to target data of books. A further extension of CBT is known as the Rating Matrix Generation Model (RMGM) [19], which relaxes the hard membership constraint on user/item groups to soft membership. Gao et al. [10] extended Bin Li at el.’s methods and proposed cluster-level based Latent Factor Model (CLFM) that achieves knowledge transfer by sharing partial cluster-level rating patterns across multiple domains. Qian Zhang et al. [44] believe that the CBT method cannot ensure that the knowledge extracted from the auxiliary domain is consistent with the target domain. To this end, they presented a cross-domain recommender system with Consistent Information Transfer (CIT) method [44], which adopts a domain adaptation strategy to make the distribution of latent factors in two domains as close as possible, and then perform a codebook transfer. Cluster-level rating pattern transfer mostly targets scenarios where there is no entity (user or item) correspondence between target data and auxiliary data. Because of the size limitation of the cluster-level rating pattern, it cannot transfer enough useful knowledge when the observed target data is quite sparse.

The other is latent factors transfer, which is to transfer user/item latent factors from the auxiliary data to the target data. CMF [35] is a multi-task learning method, which jointly factorizes the target rating matrix and item-side content matrix while sharing the same latent factors of items. Similarly, Hao Ma et al. [23] proposed SoRec method, which alternatively factorizes the target rating matrix and a user-side social network matrix while sharing the same latent factor of users. Pan et al. proposed Coordinate System Transfer (CST) model [30] and Transfer by Collective Factorization (TCF) [27, 31] to transfer the latent factors of users and items from auxiliary dense binary rating data to sparse target numerical rating data. But the two models do not take into account the neighborhood information between users and between items. Shi et al. [33] proposed twin bridge transfer learning (TBT), which transfers knowledge from dense auxiliary data to sparse target data by using latent factors and similarity graphs as twin bridges. Pan et al. [28] extended the CMF method and proposed interaction-rich transfer by collective factorization (iTCF), which not only constrains the sharing of the same item latent factors between target data and auxiliary data, but also requires information interaction between user latent factors in the two domains. Pan et al. [29] further extended the iTCF method and proposed Transfer by Mixed Factorization (TMF) approach, which introduces a virtual user profile to model the user’s implicit preference based on the iTCF method. The virtual user profile is composed of latent factors of items that the user likes and dislikes in the target domain. Zhao et al. [46] relaxed the assumption that there is an adequate set of entity correspondences across domains and employed an active learning principle to construct entity correspondences across domains. Then the actively constructed entity correspondences are plugged into a general transferred CF model to improve recommendation performance. Zhang et al. [43] proposed a cross-domain recommender system based on kernel-induced (KerKT) in a scenario where there are only partially overlapping entities between domains. kerKL exploits domain adaptation technology to adjust the feature spaces between overlapping entities, and adopts kernel diffusion completion to correlate non-overlapping entities between domains.

Pan et al. [26] gave a comprehensive survey of transfer learning for collaborative recommendation with auxiliary data, which mainly consider the CRAD (Collaborative Recommendation with Auxiliary Data) problem from a transfer learning view, and then discuss three knowledge transfer strategies for collaborative recommendation with different types of auxiliary data. Most existing TL-based CF technologies can be summarized into these three strategies, namely adaptive knowledge transfer, collective knowledge transfer, and integrated knowledge transfer. According to the survey of Pan et al. [26], our proposed EKT method belongs to collective knowledge transfer, which aims to jointly learn the shared knowledge and unshared effect of the target data and auxiliary data simultaneously. The advantage of collective knowledge transfer is that it can perform bi-directed knowledge transfer, thus it has richer interactions similar to multi-task learning. Most of the previously mentioned methods mainly focus on either latent factor transfer or cluster-level rating pattern transfer. Different from the previously work, in this paper, we try to jointly transfer the latent factors and the partial cluster-level rating patterns simultaneously, to enhance more knowledge to transfer. In addition, to alleviate the possible negative transfer issue, we use the graph co-regularization of user and item graphs to refine the latent factors of user and item to preserve the intrinsic geometric structure.

3 Enhanced transfer learning for collaborative filtering with multi-source heterogeneous feedbacks

In this section, we first define the problem setting, then propose Enhanced Knowledge Transfer for collaborative filtering with Multi-Source Heterogeneous Feedbacks (EKT) framework. Finally, we will show the optimization process of the proposed EKT method and analyze its convergence.

3.1 Problem formulation

In the target data, suppose there are a user-item numerical rating matrix \(X_{t}=[(x_{t})_{ij}]_{M\times N}\in {\{1,2,3,4,5,?\}^{M\times N}}\) (“?” denotes a missing value). \(Y_{t}=[(y_{t})_{ij}]_{M\times N}\in \{0,1\}^{M\times N}\) is the indicator matrix, (yt)ij = 1 if user i has rated item j, and (yt)ij = 0 otherwise. Similarly, in the auxiliary data, there are a user-item binary rating matrix \(X_{a}=[(x_{a})_{ij}]_{M\times N}\in {\{ 0,1,?\}}^{M \times N}\), where “0” and “1” represent the observed “like” value and “dislike” value, respectively. The question mark “?” represents the missing value. \(Y_{a}=[(y_{a})_{ij}]_{M\times N} \in \left \{ {0,1}\right \}^{M \times N}\) is the corresponding indicator matrix. It is assumed here that users and items of Xt and Xa are aligned, that is, one-to-one mapping. We aim to predict the missing values in the target rating matrix Xt by transferring the rating knowledge in the auxiliary rating matrix Xa. According to Li’s description of “domain” [17], Xt and Xa can be seen as coming from different data domains. Denote \(\mathcal {D}_{\tau }\) as the τ domain, where τ ∈{t,a} is the domain index. Therefore, we can think of it as a cross-domain recommendation problem.

3.2 EKT method

The proposed EKT framework incorporates two learning objectives into a uniform optimization problem: weighted collective matrix tri-factorization and the graph co-regularization of user and item graphs. The graph model is shown in Fig. 1.

Fig. 1
figure 1

Graphical model of Enhanced Transfer Learning for Collaborative Filtering with Multi-Source Heterogeneous Feedbacks. The target numerical rating matrix Xt and the auxiliary binary rating matrix Xa are jointly decomposed. Note that the blue solid double arrow indicates that the two domains share the same user/item’s latent factor. The blue dashed double arrow indicates that the two domains share the partial cluster-level user-item rating pattern. The black dashed arrow indicates that the graph regularization constraint is imposed on the latent factor of the user/item

3.2.1 Weighted collective matrix Tri-factorization

First, we proposed a Weighted Collective Matrix Tri-Factorization framework (WCMTF) by extending the Collective Matrix Factorization (CMF) and Weighted Nonnegative Matrix Tri-Factorization (WNMTF). Then we use WCMTF to extract the latent factors and cluster-level rating pattern, which can narrow the data distribution between domains and transfer more latent knowledge from the auxiliary domain to the target domain.

Given the target numerical rating matrix Xt and the auxiliary binary rating matrix Xa, we can extract the latent factors of each rating matrix via Weighted Nonnegative Matrix Tri-Factorization (WNTMF) [11]. In WNTMF, the nonnegative user-item rating matrix Xτ can be decomposed into three low-rank matrices, Uτ, Bτ and Vτ, such that the reconstruction error of matrix Xτ is minimized. WNMTF is equivalent to the following optimization problem:

$$ \underset{{U_{\tau} },{B_{\tau} },{V_{\tau} } \ge 0}{{\min}} \left\| {{Y_{\tau} } \odot \left( {{X_{\tau} } - {U_{\tau} }{B_{\tau} }V_{\tau}^{T}} \right)} \right\|_{F}^{2}, $$
(1)

where ⊙ represents the element-wise product of matrices, and Yτ is the indicator matrix denoting whether the rating in Xτ is observed or not. \({U_{\tau }}=[u_{\cdot 1}^{\tau },u_{\cdot 2}^{\tau },\cdot \cdot \cdot ,u_{\cdot d_{1}}^{\tau }]\in \mathbb {R}^{M\times d_{1}}\) represents the latent factor matrix of users, with each \(u_{i\cdot }^{\tau }\) denoting the latent taste of user i. \({V_{\tau }}=[v_{\cdot 1}^{\tau },v_{\cdot 2}^{\tau },\cdot \cdot \cdot ,v_{\cdot d_{2}}^{\tau }]\in \mathbb {R}^{N\times d_{2}}\) represents the latent factor matrix of items, with each \(v_{i\cdot }^{\tau }\) denoting the latent feature of item i. \({B_{\tau }} \in \mathbb {R}^{d_{1} \times d_{2}}\) is the cluster-level user-item rating pattern representing the association between Uτ and Vτ, τ ∈{t,a}.

Since the users and items of the target numerical rating matrix and the auxiliary binary rating matrix are aligned in our hypothesis, they share potential users tastes and item features. We may improve prediction accuracy by sharing the common latent factors of users and items underlying these two rating data. Similar to CMF [35], we extend the basic WNMTF to decompose two relevant matrices simultaneously, resulting in a weighted collective matrix tri-factorization

$$ \begin{array}{ll} \underset{{U_t},{U_a},{B_t},{B_a},{V_t},{V_a} \ge 0}{\min} \left\| {{Y_t} \odot \left( {{X_t} - {U_t}{B_t}V_t^T} \right)} \right\|_F^2{ + \lambda}\left\| {{Y_a} \odot \left( {{X_a} - {U_a}{B_a}V_a^T} \right)} \right\|_F^2\\ s.t. {U_t} \equiv {U_a} \equiv U,{V_t} \equiv {V_a} \equiv V, \end{array} $$
(2)

where λ > 0 is a trade-off parameter used to balance the target data with the auxiliary data. The above optimization problem can be further simplified into the following form

$$ \underset{U,{B_{t}},{B_{a}},V \ge 0}{\min} \left\| {{Y_{t}} \odot \left( {{X_{t}} - U{B_{t}}{V^{T}}} \right)} \right\|_{F}^{2}{\text{ + }}\lambda \left\| {{Y_{a}} \odot \left( {{X_{a}} - U{B_{a}}{V^{T}}} \right)} \right\|_{F}^{2}. $$
(3)

Without considering the regularization terms, the form of formula (3) is similar to the TCF framework [27, 31]. Note, however, that there are no non-negative constraints on the variables U, Ba, Bt, and V in the TCF. The TCF method and formula (3) both only transfer knowledge by sharing the same latent factors of users and items. However, when the target data is extremely sparse, it is desirable to transfer more common knowledge from the auxiliary data to alleviate the sparsity of the target data. Therefore, we believe that the knowledge transfer of them is likely to be insufficient. Gao et al. [10] proposed CLFM model, which can not only learn the common rating pattern shared cross-domain, but also learn the domain-specific rating pattern of users in each domain. The rating pattern matrix can be considered as the probability that the user group rates the corresponding item cluster. Inspired by this, we consider that the partial cluster-level user-item rating pattern can be used as a new bridge for knowledge transfer. Although the rating form of the target data and the auxiliary data are heterogeneous, i.e., {1,2,3,4,5,?} and {0,1,?}, the rating form of the pre-processed target data and the auxiliary data is partially aligned, i.e., {0,0.25,0.5,0.75,1,?} and {0,1,?}. Therefore, we can further assume that the preprocessed target data and auxiliary data share partial cluster-level user-item rating pattern.

To this end, we assume that the cluster-level rating patterns Bt and Bs in the formula (3) can be expressed as [S,St] and [S,Sa], respectively. Formally, the collective matrix tri-factorization framework in (3) can be re-represented as follows

$$ \underset{U,S,{S_{t}},{S_{a}},V \ge 0}{\min} \left\| {{Y_{t}} \odot \left( {{X_{t}} - U[S,{S_{t}}]{V^{T}}} \right)} \right\|_{F}^{2}{\text{ + }}\lambda \left\| {{Y_{a}} \odot \left( {{X_{a}} - U[S,{S_{a}}]{V^{T}}} \right)} \right\|_{F}^{2}, $$
(4)

where U and V denote the shared user latent factor matrix and item latent factor matrix,respectively. S denotes the shared user-item rating pattern matrix. St and Sa denote the specific rating pattern matrix of target data and auxiliary data, respectively.

In formula (4), we require that the latent factor matrices of users and items in both domains are exactly the same. However, although the target domain and the auxiliary domain are related, the latent user tastes and item features from two domains can still be somewhat different due to the domain specific contexture. Therefore, here we relax the constraint that the latent factors for users and items in both domains are exactly the same, and only require that they are similar, which can be achieved by adding the regularization terms \(\left \| {{U_{t}} - {U_{a}}} \right \|_{F}^{2}\) and \(\left \| {{V_{t}} - {V_{a}}} \right \|_{F}^{2}\) in the framework (4). The objective function in framework (4) can be further expressed as

$$ \begin{array}{@{}rcl@{}} \underset{{U_{t}},{U_{a}},{V_{t}},{V_{a}},S,{S_{t}},{S_{a}} \ge 0}{min} &&\left\| {{Y_{t}} \odot ({X_{t}} - {U_{t}}[S,{S_{t}}]{V_{t}}^{T})} \right\|_{F}^{2}\\ &+& \lambda \left\| {{Y_{a}} \odot ({X_{a}} - {U_{a}}[S,{S_{a}}]{V_{a}}^{T})} \right\|_{F}^{2}\\ &+& {\theta_{u}}\left\| {{U_{t}} - {U_{a}}} \right\|_{F}^{2} + {\theta_{v}}\left\| {{V_{t}} - {V_{a}}} \right\|_{F}^{2}, \end{array} $$
(5)

where Ut and Vt denote the latent factor matrix for users and items from target data, respectively. Ua and Va denote the latent factor matrix for users and items from auxiliary data, respectively. 𝜃u, 𝜃v is the tradeoff parameter, representing the confidence in the auxiliary data. The illustration of our proposed WCMTF framework can be found in Fig. 2.

Fig. 2
figure 2

Illustration of our proposed WCMTF framework

In the formula (5), the useful knowledge from auxiliary data can be adequately transferred to the target data by sharing the user latent factors, item latent factors and partial cluster-level user-item rating pattern. However, more knowledge transfer may also lead to more serious negative transfer issue, that is, harmful information in auxiliary data is transferred to target data. Especially when the target data is extremely sparse, negative transfer is more likely to occur. To do this, we need to find ways to alleviate the possible negative transfer issue.

3.2.2 Graph co-regularization of user and item graphs

We adopt the co-regularization of user and item graphs to refine the latent factors of the two domains. In this way, we can achieve two benefits: (i) The respective intrinsic geometric property within two domains can be preserved, which allows the learning model to carefully follow the domain-specific data distribution and fundamentally alleviate the negative transfer problem. (ii) When the target data is extremely sparse, the neighborhood structure information between entities (users or items) is inaccurate, while the relatively dense auxiliary binary rating data has relatively more accurate neighborhood structure information. Transferring the neighborhood structure information of entities in the auxiliary data to the target data can help the target data obtain more accurate neighborhood structure information and further alleviate the under-transfer issue.

User Graph Regularization

From geometric perspective, the data point is usually sampled from a low dimensional manifold embedded in a high-dimensional ambient space [4, 5]. By the manifold assumption [2], if two users \(x_{i \cdot }^{\tau }\) and \(x_{j \cdot }^{\tau }\) are close in the intrinsic geometry of the data distribution, then their embedding \(u_{i \cdot }^{\tau }\) and \(u_{j \cdot }^{\tau }\) are also close to each other. This geometric structure of scattered data can be effectively encoded by p-nearest neighbor graphs. Consider a user graph \(\mathcal {G}_{U}^{\tau }=(\mathcal {V}_{U}^{\tau },\mathcal {E}_{U}^{\tau })\), whose vertex set \(\mathcal {V}_{U}^{\tau }\) corresponds to users \(\{x_{1\cdot }^{\tau },\cdots ,x_{M\cdot }^{\tau }\}\). The symmetric adjacency matrix of \(\mathcal {G}_{U}^{\tau }\) can be defined as

$$ {({W_{U}^{\tau}})_{ij}} = \left\{ \begin{array}{l} sim(x_{i \cdot }^{\tau},x_{j \cdot }^{\tau}),if x_{j \cdot }^{\tau} \in \mathcal{N}(x_{i \cdot }^{\tau}) or x_{i \cdot }^{\tau} \in \mathcal{N}(x_{j \cdot }^{\tau})\\ 0, otherwise, \end{array} \right. $$
(6)

where \(\mathcal {N}(x_{i\cdot }^{\tau })\) denotes the k-nearest neighbor of \(x_{i\cdot }^{\tau }\). sim(⋅,⋅) is an appropriate similarity function for calculating similarities between users. sim can be cosine similarity, Pearson correlation coefficient or the radial basis function (RBF) measurement, etc. In our experiments, for the convenience of calculation, we used the following cosine similarity measurement: \(sim(x_{i}^{\tau },x_{j}^{\tau })=\frac {{x_{i}^{\tau }}\cdot {x_{j}^{\tau }}}{\|x_{i}^{\tau }\|\|x_{j}^{\tau }\|}\). We use Euclidian distance to measure the closeness between each pair of embeddings \(u_{i\cdot }^{\tau }\) and \(u_{j\cdot }^{\tau }\), i.e.,\(\|u_{i\cdot }^{\tau }-u_{j\cdot }^{\tau }\|_{2}^{2}\). According to Cai et al. [4], using the graph \(\mathcal {G}_{U}^{\tau }\) to maintain the geometry structure in domain \(\mathcal {D}_{\tau }\) can be accomplished via the user graph regularization as follows

$$ \begin{array}{@{}rcl@{}} \mathcal{R}({U_{\tau} }) &=& \frac{1}{2}\sum\limits_{i,j} {\left\| {u_{i \cdot }^{\tau} - u_{j \cdot }^{\tau} } \right\|_{2}^{2}} {\left( {{W_{U}^{\tau}}} \right)_{ij}}= \sum\limits_{i,j} {u_{i \cdot }^{\tau} {{\left( {{W_{U}^{\tau}}} \right)}_{ij}}u_{i \cdot }^{\tau T}} - \sum\limits_{i,j} {u_{i \cdot }^{\tau} {{\left( {{W_{U}^{\tau}}} \right)}_{ij}}u_{j \cdot }^{\tau T}} \\ &=& \sum\limits_{i} {u_{i \cdot }^{\tau} {{\left( {{D_{U}^{\tau}}} \right)}_{ii}}u_{i \cdot }^{\tau T}} - \sum\limits_{i,j} {u_{i \cdot }^{\tau} {{\left( {{W_{U}^{\tau}}} \right)}_{ij}}u_{j \cdot }^{\tau T}}= tr\left( {U_{\tau}^{T}\left( {{D_{U}^{\tau}} - {W_{U}^{\tau}}} \right){U_{\tau} }} \right)\\ &=& tr(U_{\tau}^{T}{L_{U}^{\tau}}{U_{\tau} }), \end{array} $$
(7)

where \(D_{U}^{\tau }=diag(\sum \limits _{j}(W_{U}^{\tau })_{ij})\) is a diagonal matrix, and \(L_{U}^{\tau }=D_{U}^{\tau }-W_{U}^{\tau }\) is the graph Laplacian matrix of user graph.

Item Graph Regularization

Similar to user graph regularization, by the manifold assumption [2], if two items \(x_{\cdot i}^{\tau }\) and \(x_{\cdot j}^{\tau }\) are close in the intrinsic geometry of the data distribution, then their embedding \(v_{i\cdot }^{\tau }\) and \(v_{j\cdot }^{\tau }\) are also close to each other. Consider an item graph \(\mathcal {G}_{V}^{\tau }=(\mathcal {V}_{V}^{\tau },\mathcal {E}_{V}^{\tau })\), whose vertex set \(\mathcal {V}_{V}^{\tau }\) corresponds to items \(\{x_{\cdot 1}^{\tau },\cdots ,x_{\cdot N}^{\tau }\}\). The symmetric adjacency matrix of \(\mathcal {G}_{V}^{\tau }\) can be defined as

$$ {({W_{V}^{\tau}})_{ij}} = \left\{ \begin{array}{l} sim(x_{\cdot i}^{\tau},x_{\cdot j}^{\tau}),if x_{\cdot j}^{\tau} \in \mathcal{N}(x_{\cdot i}^{\tau}) or x_{\cdot i}^{\tau} \in \mathcal{N}(x_{\cdot j}^{\tau})\\ 0, otherwise. \end{array} \right. $$
(8)

According to Cai et al. [4] using the graph \(\mathcal {G}_{V}^{\tau }\) to preserve the geometry structure in domain \(\mathcal {D}_{\tau }\) can be achieved by the item graph regularization as follows

$$ \begin{array}{@{}rcl@{}} \mathcal{R}({V_{\tau} }) &=& \frac{1}{2}\sum\limits_{i,j} {\left\| {v_{i \cdot }^{\tau} - v_{j \cdot }^{\tau} } \right\|_{2}^{2}} {\left( {{W_{V}^{\tau}}} \right)_{ij}}= \sum\limits_{i,j} {v_{i \cdot }^{\tau} {{\left( {{W_{V }^{\tau}}} \right)}_{ij}}v_{i \cdot }^{\tau T}} - \sum\limits_{i,j} {v_{i \cdot }^{\tau} {{\left( {{W_{V}^{\tau}}} \right)}_{ij}}v_{j \cdot }^{\tau T}} \\ &=& \sum\limits_{i} {v_{i \cdot }^{\tau} {{\left( {{D_{V}^{\tau}}} \right)}_{ii}}v_{i \cdot }^{\tau T}} - \sum\limits_{i,j} {v_{i \cdot }^{\tau} {{\left( {{W_{V}^{\tau}}} \right)}_{ij}}v_{j \cdot }^{\tau T}} = tr\left( {V_{\tau}^{T}\left( {{D_{V}^{\tau}} - {W_{V}^{\tau}}} \right){V_{\tau} }} \right)\\ &=& tr(V_{\tau}^{T}{L_{V}^{\tau}}{V_{\tau} }), \end{array} $$
(9)

where \(D_{V}^{\tau }=diag(\sum \limits _{j}(W_{V}^{\tau })_{ij})\) is a diagonal matrix, and \(L_{V}^{\tau }=D_{V}^{\tau }-W_{V}^{\tau }\) is the graph Laplacian matrix of item graph.

We call the graph regularization terms in (7) and (9) the graph co-regularization of user and item graphs (GCRUI). On the one hand, GCRUI is to maintain the intrinsic geometric structure on users and items simultaneously for alleviating the negative transfer issue. On the other hand, when the target data is extremely sparse, relatively dense and accurate neighborhood structure information between entities in auxiliary data can be transferred to the target data, to further alleviate the under-transfer issue.

3.3 Optimization framework

In order to further improve the predictive performance of cross-domain recommendations, we should consider these two learning objectives together. The reasons are: (i) with weighted collective matrix tri-factorization, the common latent factors of users and items and the common partial cluster-level rating pattern are extracted, through which more knowledge can be transferred between domains. However, while enhancing the knowledge transfer, negative information of the auxiliary data may also be introduced into the target data, resulting in a negative transfer issue. (ii) with GCRUI, the intrinsic geometric structure of each domain can be preserved to alleviate negative transfer issue. In addition, when the target data is extremely sparse, the neighborhood structure information in the auxiliary data will be transferred to the target data to help it obtain more accurate neighborhood structure information, thereby further alleviating the under-transfer issue.

Based on the above considerations, we seamlessly integrate the weighted collective matrix tri-factorization and GCRUI into a unified framework, and obtain the following objective function

$$ \begin{array}{@{}rcl@{}} \underset{{U_{t}},{U_{a}},{V_{t}},{V_{a}},S,{S_{t}},{S_{a}} \ge 0}{\min} &&\left\| {{Y_{t}} \odot ({X_{t}} - {U_{t}}[S,{S_{t}}]{V_{t}}^{T})} \right\|_{F}^{2} \\ &+& \lambda \left\| {{Y_{a}} \odot ({X_{a}} - {U_{a}}[S,{S_{a}}]{V_{a}}^{T})} \right\|_{F}^{2} \\ &+& {\theta_{u}}\left\| {{U_{t}} - {U_{a}}} \right\|_{F}^{2} + {\theta_{v}}\left\| {{V_{t}} - {V_{a}}} \right\|_{F}^{2} \\ &+&{\alpha_{u}}tr({U_{t}^{T}}{{L_{U}^{t}}}{U_{t}}) + {\alpha_{v}}tr({V_{t}^{T}}{{L_{V}^{t}}}{V_{t}}) \\ &+&{\beta_{u}}tr({U_{a}^{T}}{{L_{U}^{a}}}{U_{a}}) + {\beta_{v}}tr({V_{a}^{T}}{{L_{V}^{a}}}{V_{a}}), \end{array} $$
(10)

where αu and αv are the graph regularization parameters of the user graph and item graph from the target data, respectively. βu and βv are the graph regularization parameters of the user graph and item graph from the auxiliary data, respectively.

Finally, we impose the Frobenius norm on S, St and Sa to avoid overfitting. We end up with the following minimization problem for EKT

$$ \begin{array}{@{}rcl@{}} \underset{{U_{t}},{U_{a}},{V_{t}},{V_{a}},S,{S_{t}},{S_{a}} \ge 0}{\min} J&=&\left\| {{Y_{t}} \odot ({X_{t}} - {U_{t}}[S,{S_{t}}]{V_{t}}^{T})} \right\|_{F}^{2} \\ &+& \lambda \left\| {{Y_{a}} \odot ({X_{a}} - {U_{a}}[S,{S_{a}}]{V_{a}}^{T})} \right\|_{F}^{2} \\ &+& {\theta_{u}}\left\| {{U_{t}} - {U_{a}}} \right\|_{F}^{2} + {\theta_{v}}\left\| {{V_{t}} - {V_{a}}} \right\|_{F}^{2} \\ &+&{\alpha_{u}}tr({U_{t}^{T}}{{L_{U}^{t}}}{U_{t}}) + {\alpha_{v}}tr({V_{t}^{T}}{{L_{V}^{t}}}{V_{t}}) \\ &+&{\beta_{u}}tr({U_{a}^{T}}{{L_{U}^{a}}}{U_{a}}) + {\beta_{v}}tr({V_{a}^{T}}{{L_{V}^{a}}}{V_{a}}) \\ &+&{\alpha_{s}}\left\| {{S_{t}}} \right\|_{F}^{2} + {\beta_{s}}\left\| {{S_{a}}} \right\|_{F}^{2} + {\gamma_{s}}\left\| S \right\|_{F}^{2}, \end{array} $$
(11)

where αs, βs and γs are the regularization parameters for controlling the strength of regularization.

3.4 Learning the EKT

In order to facilitate the optimization of the EKT method, we rewrite the (11) as

$$ \begin{array}{@{}rcl@{}} \underset{{U_{t}},{U_{a}},{V_{t}},{V_{a}},S,{S_{t}},{S_{a}} \ge 0}{\min} J &=&\left\| {{Y_{t}} \odot ({X_{t}} - {U_{t}}SV_{t_{0}}^{T} - {U_{t}}{S_{t}}{V_{t_{1}}^{T}})} \right\|_{F}^{2}\\ &+& \lambda \left\| {{Y_{a}} \odot ({X_{a}} - {U_{t}}SV_{a_{0}}^{T} - {U_{t}}{S_{a}}{V_{a_{1}}^{T}})} \right\|_{F}^{2} \\ &+& {\theta_{u}}\left\| {{U_{t}} - {U_{a}}} \right\|_{F}^{2} + {\theta_{v}}\left\| {{V_{t}} - {V_{a}}} \right\|_{F}^{2} \\ &+& {\alpha_{u}}tr({{U_{t}^{T}}}{{L_{U}^{t}}}{U_{t}}) + {\alpha_{v}}tr({{V_{t}^{T}}}{{L_{V}^{t}}}{V_{t}}) \\ &+& {\beta_{u}}tr({{U_{a}^{T}}}{{L_{U}^{a}}}{U_{a}}) + {\beta_{v}}tr({{V_{a}^{T}}}{{L_{V}^{a}}}{V_{a}}) \\ &+& {\alpha_{s}}\left\| {{S_{t}}} \right\|_{F}^{2} + {\beta_{s}}\left\| {{S_{a}}} \right\|_{F}^{2}{\text{ + }}{\gamma_{s}}\left\| S \right\|_{F}^{2}, \end{array} $$
(12)

where \({U_{t}}, {U_{a}} \in {\mathbb {R}^{M \times d_{1}}}\), \({V_{t}} = [{V_{{t_{0}}}}, {V_{{t_{1}}}}] \in {\mathbb {R}^{N \times d_{2}}}\), \({V_{a}} = [{V_{{a_{0}}}}, {V_{{a_{1}}}}] \in {\mathbb {R}^{N \times d_{2}}}\) , \(S \in {\mathbb {R}^{d_{1} \times c}}\), \({S_{t}} \in {\mathbb {R}^{d_{1} \times \left ({d_{2} - c} \right )}}\), \({S_{a}} \in {\mathbb {R}^{d_{1} \times \left ({d_{2} - c} \right )}}\). The optimization of our proposed EKT method can be solved by an alternating minimization algorithm. Specifically, we optimize a variable and compute its update rule while fixing the remaining variables. The procedure is repeated until convergence.

Learning S, S a and S t

Fix other variables to solve S, then we can rewrite the objective function in (12) as

$$ \begin{array}{@{}rcl@{}} \underset{S \geq 0}{\min} J(S) &=& \left\| {{Y_{t}} \odot ({X_{t}} - {U_{t}}S{V_{{t_{0}}}} - {U_{t}}{S_{t}}{V_{t_{1}}^{T}})} \right\|_{F}^{2} \\ &+& \lambda \left\| {{Y_{a}} \odot ({X_{a}} - {U_{a}}S{V_{{a_{0}}}} - {U_{a}}{S_{a}}{V_{{a_{1}}}^{T}})} \right\|_{F}^{2} + {\gamma_{s}}\left\| S \right\|_{F}^{2}. \\ \end{array} $$

The derivative of J(S) in regard to S is as follows

$$ \begin{array}{@{}rcl@{}} \frac{{\partial J(S)}}{{\partial S}} &=& 2{U_{t}^{T}}({Y_{t}} \odot (- {X_{t}} + {U_{t}}SV_{{t_{0}}}^{T} + {U_{t}}{S_{t}}V_{{t_{1}}}^{T})){V_{{t_{0}}}}\\ &+& 2\lambda {U_{a}^{T}}({Y_{a}} \odot (- {X_{a}} + {U_{a}}SV_{{a_{0}}}^{T} + {U_{a}}{S_{a}}V_{{a_{1}}}^{T})){V_{{a_{0}}}}{\text{ + }}2{\gamma_{s}}S. \end{array} $$

Utilizing the Karush-Kuhn-Tucker(KKT) complementary condition for the non-negativity of S and letting \(\frac {{\partial J(S)}}{{\partial S}}{\text { = }}0\), we can get

$$ \begin{array}{@{}rcl@{}} &&[2{U_{t}^{T}}({Y_{t}} \odot (- {X_{t}} + {U_{t}}SV_{{t_{0}}}^{T} + {U_{t}}{S_{t}}V_{{t_{1}}}^{T})){V_{{t_{0}}}} + 2{\gamma_{s}}S\\ && + 2\lambda {U_{a}^{T}}({Y_{a}} \odot (- {X_{a}} + {U_{a}}SV_{{a_{0}}}^{T} + {U_{a}}{S_{a}}V_{{a_{1}}}^{T})){V_{{a_{0}}}}] \odot S = 0. \end{array} $$

Then we can obtain the updating rule for S as follows

$$ S \leftarrow S \odot \frac{{\left[ {{{U_{t}^{T}}}\left( {{Y_{t}} \odot {X_{t}}} \right){V_{{t_{0}}}} + \lambda {{U_{a}^{T}}}({Y_{a}} \odot {X_{a}}){V_{{a_{0}}}}} \right]}}{{[A + B]}} $$
(13)
$$A = {U_{t}^{T}}({Y_{t}} \odot ({U_{t}}SV_{{t_{0}}}^{T} + {U_{t}}{S_{t}}V_{{t_{1}}}^{T})){V_{{t_{0}}}}$$
$$B = \lambda {U_{a}^{T}}({Y_{a}} \odot ({U_{a}}SV_{{a_{0}}}^{T} + {U_{a}}{S_{a}}V_{{a_{1}}}^{T})){V_{{a_{0}}}} + {\gamma_{s}}S,$$

where \(\frac {{[ \cdot ]}}{{[ \cdot ]}}\) denotes element-wise division. Similarly, we can obtain the following updating rules for learning St and Sa

$$ {S_{t}} \leftarrow {S_{t}} \odot \displaystyle\frac{{[{U_{t}^{T}}\left( {{Y_{t}} \odot {X_{t}}} \right){V_{{t_{1}}}}]}}{{[{U_{t}^{T}}(({Y_{t}} \odot ({U_{t}}SV_{{t_{0}}}^{T} + {U_{t}}{S_{t}}V_{{t_{1}}}^{T})){V_{{t_{1}}}}{\text{ + }}{\alpha_{s}}{S_{t}}]}} $$
(14)
$$ {S_{a}} \leftarrow {S_{a}} \odot \displaystyle\frac{{[\lambda {U_{a}^{T}}({Y_{a}} \odot {X_{a}}){V_{{a_{1}}}}]}}{{[\lambda {U_{a}^{T}}({Y_{a}} \odot ({U_{a}}SV_{{a_{0}}}^{T} + U{S_{a}}V_{{a_{1}}}^{T})){V_{{a_{1}}}}{\text{ + }}{\beta_{s}}{S_{a}}]}}. $$
(15)

Learning U t, U a, V t and V a

Similarly, fix other variables to solve Ut, then we rewrite the objective function in (11) as

$$ \begin{array}{@{}rcl@{}} \underset{{U_{t}} \geq 0}{\min} J\left( {{U_{t}}} \right) &=& \left\| {{Y_{t}} \odot ({X_{t}} - {U_{t}}[S,{S_{t}}]{V_{t}}^{T})} \right\|_{F}^{2}\\ &+& {\theta_{u}}\left\| {{U_{t}} - {U_{a}}} \right\|_{F}^{2}+{\alpha_{u}}tr({U_{t}}^{T}{{L_{U}^{t}}}{U_{t}}). \end{array} $$

The derivative of J(Ut) with respect to Ut is as follows

$$ \displaystyle\frac{{\partial J\left( {{U_{t}}} \right)}}{{\partial {U_{t}}}} = 2({Y_{t}} \odot (- {X_{t}} + {U_{t}}[S,{S_{t}}]{V_{t}^{T}})){V_{t}}{[S,{S_{t}}]^{T}} +2{\theta_{u}}{(U_{t}-U_{a})}+2{\alpha_{u}}{{L_{U}^{t}}}{U_{t}}. $$

Using the Karush-Kuhn-Tucker(KKT) complementary condition for the nonnegativity of Ut and letting \(\frac {{\partial J({U_{t}})}}{{\partial {U_{t}}}}{\text { = }}0\), we can get

$$ [({Y_{t}} \odot (- {X_{t}} + {U_{t}}[S,{S_{t}}]{V_{t}^{T}})){V_{t}}{[S,{S_{t}}]^{T}} +{\theta_{u}}{(U_{t}-U_{a})}+{\alpha_{u}}{{L_{U}^{t}}}{U_{t}}] \odot {U_{t}} = 0. $$

Since \({{L_{U}^{t}}}\) may take any signs, we decompose it as \({{L_{U}^{t}}}{\text { = }}L_{U}^{t+}{\text { - }}L_{U}^{t-}\), where \(L_{U}^{t+}=\frac {1}{2}(|{{L_{U}^{t}}}|+{{L_{U}^{t}}})\), \(L_{U}^{t-} {\text { = }}\frac {1}{2}\left ({\left | {{{L_{U}^{t}}}} \right |{\text { - }}{{L_{U}^{t}}}} \right )\), then

$$ \begin{array}{@{}rcl@{}} &&[({Y_{t}} \odot (- {X_{t}} + {U_{t}}[S,{S_{t}}]{V_{t}^{T}})){V_{t}}{[S,{S_{t}}]^{T}} +{\theta_{u}}{(U_{t}-U_{a})}+{\alpha_{u}}L_{U}^{t+}{U_{t}}\\ && - {\alpha_{u}}L_{U}^{t-} {U_{t}}] \odot {U_{t}} = 0. \end{array} $$

We obtain the updating rule for learning Ut as follows

$$ {U_{t}} \leftarrow {U_{t}} \odot\displaystyle\frac{{[\left( {{Y_{t}} \odot {X_{t}}} \right){V_{t}}{{[S,{S_{t}}]}^{T}} + {\theta_{u}}{U_{a}}+{\alpha_{u}}L_{U}^{t-}{U_{t}}]}}{{[\left( {{Y_{t}} \odot \left( {{U_{t}}[S,{S_{t}}]{{V_{t}^{T}}}} \right)} \right){V_{t}}{{[S,{S_{t}}]}^{T}} + {\theta_{u}}{U_{t}}+{\alpha_{u}}L_{U}^{t+}{U_{t}}]}}. $$
(16)

The latent factor Ua, Vt and Va can be learned in the similar way as for constrained optimization. We can get the updating rule for learning Ua, Vt and Va as follows:

$$ {U_{a}} \leftarrow {U_{a}} \odot\displaystyle\frac{{[\lambda \left( {{Y_{a}} \odot {X_{a}}} \right){V_{a}}{{[S,{S_{a}}]}^{T}} + {\theta_{u}}{U_{t}}+{\beta_{u}}L_{U}^{a-}{U_{a}}]}}{{[\lambda \left( {{Y_{a}} \odot \left( {{U_{a}}[S,{S_{a}}]{{V_{a}^{T}}}} \right)} \right){V_{a}}{{[S,{S_{a}}]}^{T}} + {\theta_{u}}{U_{a}}+{\beta_{u}}L_{U}^{a+}{U_{a}}]}} $$
(17)
$$ {V_t} \leftarrow {V_t} \odot \displaystyle\frac{{\left[ {{{\left( {{Y_t} \odot {X_t}} \right)}^T}{U_t}\left[ {S,{S_t}} \right] +{\theta_v}{V_a}+ {\alpha _v}L_{V}^{t- }{V_t}} \right]}}{{\left[ {{{\left( {{Y_t} \odot \left( {{U_t}\left[ {S,{S_t}} \right]V_t^T} \right)} \right)}^T}{U_t}\left[ {S,{S_t}} \right]{ + }{\theta_v}{V_t}+{\alpha _v}L_{V}^{t+ }{V_t}} \right]}} $$
(18)
$$ {V_a} \leftarrow {V_a} \odot \displaystyle\frac{{\left[ {{{\left( {{Y_a} \odot {X_a}} \right)}^T}{U_a}\left[ {S,{S_a}} \right] +{\theta_v}{V_t}+ {\beta _v}L_{V}^{a-}{V_a}} \right]}}{{\left[ {{{\left( {{Y_a} \odot \left( {{U_a}\left[ {S,{S_a}} \right]V_a^T} \right)} \right)}^T}{U_a}\left[ {S,{S_a}} \right]{ + }{\theta_v}{V_a}+{\beta _v}L_{V}^{a+}{V_a}} \right]}}, $$
(19)

where \({L_{U}^{a}}=L_{U}^{a+}-L_{U}^{a-} \), \(L_{U}^{a+}=\frac {1}{2}(|{L_{U}^{a}}|+{L_{U}^{a}})\), \(L_{U}^{a-}=\frac {1}{2}(|{L_{U}^{a}}|-{L_{U}^{a}})\), \({L_{V}^{t}}=L_{V}^{t+}-L_{V}^{t-} \), \(L_{V}^{t+}=\frac {1}{2}(|{L_{V}^{t}}|+{L_{V}^{t}})\), \(L_{V}^{t-}=\frac {1}{2}(|{L_{V}^{t}}|-{L_{V}^{t}})\), \({L_{V}^{a}}=L_{V}^{a+}-L_{V}^{a-} \), \(L_{V}^{a+}=\frac {1}{2}(|{L_{V}^{a}}|+{L_{V}^{a}})\), \(L_{V}^{a-}=\frac {1}{2}(|{L_{V}^{a}}|-{L_{V}^{a}})\). Note that \({V_{{t_{0}}}}{\text { = }}{V_{t}}(:,1:c)\), \({V_{{t_{1}}}}{\text { = }}{V_{t}}(:,c + 1:d_{2})\), \({V_{{a_{0}}}}{\text { = }}{V_{a}}(:,1:c)\), \({V_{{a_{1}}}}{\text { = }}{V_{a}}(:,c + 1:d_{2})\).

Theorem 1

Updating S, St, Sa, Ut, Ua, Vt and Va sequentially by (13)\(\sim \) (19) will monotonically decrease the objective function in (11) until convergence.

We will prove Theorem 1 in Section 3.5. The learning algorithm for EKT is summarized in Algorithm 1.

figure a

3.5 Convergence analysis

We can adopt the auxiliary function approach [4, 16, 21] to prove Theorem 1. For simplicity, we will only prove that the objective function J in (11) decreases monotonically under the update rule for Ut in (16). We can prove the convergence of the other update rules in a similar way. The definition of auxiliary function is described as follows

Definition 1

[16] \(Q(p,p^{\prime })\) is an auxiliary function for F(p) if the conditions

$$Q(p,p^{\prime}) \ge F(p)\quad and \quad Q(p,p){\text{ = }}F(p)$$

are satisfied for any given p, \(p^{\prime }\).

Lemma 1

[16] If Q is an auxiliary function for F, then F is decreasing under the update

$$ {p^{\left( {k + 1} \right)}} = \arg {\min_{p}}Q\left( {p,{p^{\left( k \right)}}} \right). $$
(20)

Proof

$$F({p^{\left( {k + 1} \right)}}) \le Q({p^{\left( {k + 1} \right)}},{p^{\left( k \right)}}) \le Q({p^{\left( k \right)}},{p^{\left( k \right)}}) = F({p^{\left( k \right)}}).$$

Next, by constructing an appropriate auxiliary function, we will demonstrate that (16) is exactly the update rule of Lemma 1. For any element uij in Ut, Fij is used to represent the part of J that is only relevant to uij. We compute the corresponding first and second derivatives of Fij in regard to uij as follows □

$$ \begin{array}{@{}rcl@{}} F^{\prime}_{ij} &=& \left[ {{\text{2}}\left( {{Y_{t}} \odot \left( { - {X_{t}} + {U_{t}}[S,{S_{t}}]{V_{t}}^{T}} \right)} \right){V_{t}}{{[S,{S_{t}}]}^{T}}} \right.\\ &&{\left. {{\text{ + }}2{\theta_{u}}{(U_{t}-U_{a})}+2{\alpha_{u}}L_{U}^{t+}{U_{t}}-2{\alpha_{u}}L_{U}^{t-}{U_{t}}} \right]_{ij}} \end{array} $$
$$ F^{\prime\prime}_{ij} = 2{\left( {{Y_{t}}} \right)_{ij}}{\left( {\left( {[S,{S_{t}}]{V_{t}}^{T}} \right){V_{t}}{{[S,{S_{t}}]}^{T}}} \right)_{jj}} + 2{\theta_{u}} + 2{\alpha_{u}}{\left( {L_{U}^{t+} } \right)_{ii}}- 2{\alpha_{u}}{\left( {L_{U}^{t-} } \right)_{ii}}. $$

Lemma 2

Function

$$ \begin{array}{@{}rcl@{}} Q(u,{u_{ij}^{(k)}}) &=& {F_{ij}}(u_{ij}^{(k)}) + F^{\prime}_{ij}(u_{ij}^{(k)})(u - u_{ij}^{(k)}) \\ &+&\frac{{{{\left( {2\left( {{Y_{t}} \odot \left( {{U_{t}}[S,{S_{t}}]{V_{t}}^{T}} \right)} \right){V_{t}}{{[S,{S_{t}}]}^{T}} + 2{\theta_{u}}{U_{t}}+2{\alpha_{u}}L_{U}^{t+}{U_{t}}} \right)}_{ij}} }}{{u_{ij}^{(k)}}}\cdot{(u - u_{ij}^{(k)})^{2}} \end{array} $$
(21)

is an appropriate auxiliary function for Fij(u).

Proof

It is straightforward that \(Q\left ({u,u} \right ) = {F_{ij}}\left (u \right )\), and hence we only need verify that \(Q\left ({u,u_{ij}^{\left (k \right )}} \right ) \ge {F_{ij}}\left (u \right )\). To achieve this, we use Taylor series to expand \({F_{ij}}\left (u \right )\)

$$ \begin{array}{@{}rcl@{}} {F_{ij}}\left( u \right) &=& {F_{ij}}\left( {u_{ij}^{\left( k \right)}} \right) + F^{\prime}_{ij}\left( {u_{ij}^{\left( k \right)}} \right)\left( {u - u_{ij}^{\left( k \right)}} \right)\\ &+& \left( {2{{\left( {{Y_{t}}} \right)}_{ij}}{{\left( {\left( {[S,{S_{t}}]{V_{t}}^{T}} \right){V_{t}}{{[S,{S_{t}}]}^{T}}} \right)}_{jj}}} +2{\theta_{u}}\right.\\ &+& 2{\alpha_{u}}{\left( {L_{U}^{t+} } \right)_{ii}} \left. { - 2{\alpha_{u}}{{\left( {L_{U}^{t-} } \right)}_{ii}}} \right) \cdot {\left( {u - u_{ij}^{\left( k \right)}} \right)^{2}}. \end{array} $$

Through algebra operations, we can get two inequalities:

$$ \begin{array}{@{}rcl@{}} {F_{ij}}\left( u \right) &=& {F_{ij}}\left( {u_{ij}^{\left( k \right)}} \right) + F^{\prime}_{ij}\left( {u_{ij}^{\left( k \right)}} \right)\left( {u - u_{ij}^{\left( k \right)}} \right)\\ &+& \left( {2{{\left( {{Y_t}} \right)}_{ij}}{{\left( {\left( {[S,{S_t}]{V_t}^T} \right){V_t}{{[S,{S_t}]}^T}} \right)}_{jj}}} +2{\theta_u}\right.\\ &+& 2{\alpha _u}{\left( {L_{U}^{t+} } \right)_{ii}} \left. { - 2{\alpha _u}{{\left( {L_{U}^{t-} } \right)}_{ii}}} \right) \cdot {\left( {u - u_{ij}^{\left( k \right)}} \right)^2}. \end{array} $$
$${\left( {L_{U}^{t+} {U_{t}}} \right)_{ij}} = \sum\limits_{q} {u_{qj}^{\left( k \right)}} {\left( {L_{U}^{t+} } \right)_{iq}} \ge u_{ij}^{\left( k \right)}{\left( {L_{U}^{t+} } \right)_{ii}}.$$

By jointly comparing the above two inequalities, we have

$$ \begin{array}{@{}rcl@{}} \displaystyle\frac{{{{\left( {2\left( {{Y_t} \odot \left( {{U_t}[S,{S_t}]V_t^T} \right)} \right){V_t}{{[S,{S_t}]}^T} + 2{\theta_u}{U_t}+2{\alpha _u}L_{U}^{t+} {U_t}} \right)}_{ij}} }}{{u_{ij}^{\left( k \right)}}} \\ \ge 2{\left( {{Y_t}} \right)_{ij}}{\left( {\left( {\left[ {S,{S_t}} \right]V_t^T} \right){V_t}{{\left[ {S,{S_t}} \right]}^T}} \right)_{ij}}+2{\theta_u}+ 2{\alpha _u}{\left( {L_{U}^{t+} } \right)_{ii}} - 2{\alpha _u}{\left( {L_{U}^{t-} } \right)_{ii}}. \\ \end{array} $$

Further, we can get \(Q\left ({u,u_{ij}^{\left (k \right )}} \right ) \ge {F_{ij}}\left (u \right )\), and Lemma 2 holds. □

Proof Proof of Theorem 1

According to Lemmas 1 and 2, we can obtain the update rule for Ut by minimizing \(Q\left ({u_{ij}^{\left ({k + 1} \right )}, u_{ij}^{\left (k \right )}} \right )\). Setting \(\frac {{\partial Q\left ({u_{ij}^{\left ({k + 1} \right )}, u_{ij}^{\left (k \right )}} \right )}}{{\partial u_{ij}^{\left ({k + 1} \right )}}} = 0\), we get

$$ \begin{array}{@{}rcl@{}} u_{ij}^{\left( {k + 1} \right)} &=& u_{ij}^{\left( k \right)}-\displaystyle\frac{{u_{ij}^{\left( k \right)}F^{\prime}_{ij}\left( {u_{ij}^{\left( k \right)}} \right)}}{{{{\left( {2\left( {{Y_{t}} \odot \left( {{U_{t}}[S,{S_{t}}]{V_{t}^{T}}} \right)} \right){V_{t}}{{[S,{S_{t}}]}^{T}} + 2{\theta_{u}}{U_{t}}+2{\alpha_{u}}L_{U}^{t+} {U_{t}}} \right)}_{ij}}}} \\ &=&u_{ij}^{\left( k \right)}\displaystyle\frac{{{{\left( {\left( {{Y_{t}} \odot {X_{t}}} \right){V_{t}}{{[S,{S_{t}}]}^{T}} + 2{\theta_{u}}{U_{a}}+{\alpha_{u}}L_{U}^{t-} {U_{t}}} \right)}_{ij}}}}{{{{\left( {\left( {{Y_{t}} \odot \left( {{U_{t}}[S,{S_{t}}]{V_{t}}^{T}} \right)} \right){V_{t}}{{[S,{S_{t}}]}^{T}} + 2{\theta_{u}}{U_{t}}+{\alpha_{u}}L_{U}^{t+} {U_{t}}} \right)}_{ij}}}}. \end{array} $$

This above update rule is uniform with (16). For each iteration of updating, we can obtain

$$ \begin{array}{@{}rcl@{}} J\left( {U_{t}^{\left( 0 \right)}} \right) &=& Q\left( {U_{t}^{\left( 0 \right)},U_{t}^{\left( 0 \right)}} \right) \ge Q\left( {U_{t}^{\left( 1 \right)},U_{t}^{\left( 0 \right)}} \right) \\ &\ge& Q\left( {U_{t}^{\left( 1 \right)},U_{t}^{\left( 1 \right)}} \right) = J\left( {U_{t}^{\left( 1 \right)}} \right) \ge \cdot \cdot \cdot \ge J\left( {U_{t}^{\left( T \right)}} \right). \end{array} $$

Therefore, J(U) is monotonically decreasing during iterations. Since the objective function in (11) is obviously bounded below, we prove the convergence of Theorem 1. □

3.6 Analysis

When 𝜃u = 𝜃v = 0, αu = αv = 0, βu = βv = 0, EKT reduces to CLFM [10], which does not involve the similarity constraints of latent factor and the co-graph regularization of user and item graphs. When c = 0, 𝜃u = 𝜃v = 0, βu = βv = 0, EKT reduces to GWNMTF [11], which focuses on learning the latent factors of users and items and user-item rating pattern using only target rating data. When c = 0, 𝜃u = 𝜃v = 0, βu = βv = 0, αu = αv = 0, EKT reduces to WNMTF, which does not involve the user and item graphs. Therefore, our EKT is generic and absorbs CLFM, GWNMTF and WNMTF as special cases.

4 Experiments

4.1 Data sets and evaluation metrics

4.1.1 Data sets

We evaluate our proposed method using the subset of two movie rating data sets MovieLens10MFootnote 1 (denoted as ML10M) and NetflixFootnote 2. The ML10M rating data contains more than 107 rating with values in {0.5,1,1.5,2,2.5,3,3.5,4,4.5,5}, which are given by more 7.1 × 104 users on around 1.1 × 105 movies. Similar to the experimental methods of Pan et.al. [31], we randomly extract a 5000 × 5000 dense rating matrix R from ML10M data, and then R is randomly split into training set RT and test set RE each with a 50% ratings. RE remains unchanged, while target training data Xt with different sparse ratings of 0.01%, 0.05%, 0.1%, 0.5% and 1% are constructed by randomly sampling corresponding numbers of observed rating of 2500, 12500, 25000, 125000 and 250000 from RT. The auxiliary data Xa is constructed by randomly sampling 100 observed ratings on average from RT for each user. A pre-processing approach[34] is adopted by relabeling ratings with value less than 4 in Xa as 0 (dislike), and then ratings with value greater than or equal to 4 as 1 (like). The overlap between Xt and Xa are 0.0015%, 0.0075%, 0.015%, 0.075% and 0.15% correspondingly.

The Netflix rating data contains more than 108 ratings with values in {1,2,3,4,5}, which are given by more than 4.8 × 105 users on around 1.8 × 104 movies. The data set employed in the experiments is constructed in the same way as the ML10M data. The final data sets are summed up in Table 2.

Table 2 Description of subset of ML10M data (M=N = 5000) and subset of Netflix data (M=N = 5000)

4.1.2 Evaluation metrics

We employed the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) as the evaluation metrics

$$ MAE = \sum\limits_{(u,i,{x_{ui}})\in {R_{E}}} {\left| {{x_{ui}} - {{\hat x}_{ui}}} \right|/\left| {{R_{E}}} \right|} $$
(22)
$$ RMSE = \sqrt {\sum\limits_{(u,i,{x_{ui}}) \in {R_{E}}} {{{\left( {{x_{ui}} - {{\hat x}_{ui}}} \right)}^{2}}/\left| {{R_{E}}} \right|} }, $$
(23)

where \(\hat {x}_{ui}\) and xui represent the predicted and true ratings, respectively, and |RE| represents the number of test ratings. When the required number of observed ratings are generated from RT, we run 5 randomised trials, and report the average results.

4.2 Baselines and parameter settings

4.2.1 Baselines

In our experiments, we compare our proposed EKT with the following seven closely related baseline algorithms:

  • WNMTF (Weighted Nonnegative Matrix Tri-Factorization) is a single-domain recommendation method that decomposes the target user-item rating matrix into the product of three low-rank non-negative matrices, which are then used to predict the rating and provide a recommendation.

  • GWNMTF (Graph Regularized Weighted Nonnegative Matrix Tri-Factorization) [11] is a non-transfer learning method based on graph regularization. In this model, two graphs are constructed on user side and item side to utilize the internal information and external information.

  • PMF (Probabilistic Matrix Factorization) [24] adopts a probabilistic model with Gaussian observation noise to learn the latent features of users and items by maximizing the conditional distribution of the latent feature matrix of users and items over the observed target ratings. It learns on the target domain only.

  • TCF (Transfer by Collective Factorization) [30] is based on transfer learning method using binary auxiliary data. In this model, the shared latent space U and V are constructed in a collective manner, and the data-dependent effect is captured via learning inner matrices B, \(\tilde {B}\) separately. In TCF, there are two variants namely CMTF (Collective Matrix Tri-Factorization) and CSVD (Collective SVD).

  • iTCF (Interaction-rich Transfer by Collective Factorization) [28] is an efficient transfer learning algorithm in collaborative filtering with heterogeneous user feedbacks. In iTCF, Richer interactions are introduced by sharing both item-specific latent features and the predictability in two heterogeneous data in a smooth manner.

  • TMF (Transfer by Mixed Factorization) [29] is a generic mixed factorization based transfer learning framework for collaborative recommendation with heterogeneous explicit feedbacks. TMF unifies two transfer methods in one optimization framework: instance-based transfer by intergrative factorization and feature-based transfer by collective factorization.

  • WNMTF-TL (Weight Nonnegative Matrix Tri-Factorization based on Transfer Learning). Active Transfer Learning for Cross-System Recommendation [46] method is proposed to construct cross-domain entity correspondences and then the actively constructed entity correspondences plugged into a general matrix factorization model. In our problem formulate, cross-domain entities are one-to-one correspondence, therefore we removed the active learning module in the originally proposed method. In addition, for a fairer comparison with our EKT method, we adopt WNMTF as the matrix factorization model, and then use the entity similarity learned from the auxiliary binary rating data as a prior to constrain the entity similarity in the target domain. We name the method as WNMTF-TL.

Our EKT (Enhanced Knowledge Transfer for Collaborative Filtering with Multi-Source Heterogeneous Feedbacks) method seamlessly integrates the weighted collective matrix tri-factorization and graph co-regularization of user and item graphs into a unified framework that can simultaneously enhance knowledge transfer and alleviate negative transfer.

4.2.2 Parameter settings

For all methods, different numbers of latent factors d1,d2 ∈{5,10,15,20} are tried. For EKT, different number of shared latent user-item rating patterns c ∈{0,1,⋯ ,d2} are tried. Note that when c = 0, there is no shared user-item rating pattern between the auxiliary data and the target data, and when c = d2, the latent user-item rating pattern is fully shared between the auxiliary data and the target data. For GWNMTF, different trade-off parameters λ = μ ∈{0.01,0.1,1,10,100} are tried. For PMF, different trad-off parameters λU = λV ∈{0.01,0.1,1} are tried. For TCF (CMTM), β is fixed as 1, and different tradeoff parameters αu = αv ∈{0.01,0.1,1}, λ ∈{0.01,0.1,1} are tried. For TCF (CSVD), different trade-off parameters λ ∈{0.01,0.1,1} are tried. For iTCF, we fixed the trade-off parameter λ = 1, ρ = 0.5, and different trade-off parameters αu = αv ∈{0.01,0.1,1}, βu = βv ∈{0.01,0.1,1} are tried. For TMF, we fixed the trade-off parameter λ = 1, ρ = 0.5, δP = δN = 1, wp = 2, wN = 1, and different trade-off parameters αu = αv ∈{0.01,0.1,1}, βu = βv ∈{0.01,0.1,1,10,100} are tried. For WNMTF-TL, different trade-off parameters \(\lambda _{\mathcal {C}}\in \{0.01,0.1,1,10,100\}\) are tried. For EKT, λ, αs, βs and γs are fixed as 1, different trade-off parameters of αu = αv ∈{0.01,0.1,1}, βu = βv ∈{0.01,0.1,1}, 𝜃u = 𝜃v ∈{0.1,0.5,1,5,10} are tried. Further analysis of the parameters is provided in Section 4.5. Note that for EKT, to alleviate the heterogeneity between the target data and the auxiliary data, the target rating matrices from ML10M or Neflix for training are preprocessed by letting (Xt)ui = ((Xt)ui − 0.5)/4.5 or (Xt)ui = ((Xt)ui − 1)/4, respectively. For iTCF and TMF, follow Pan et al. [28, 29], “0 (dislike)” and “1 (like)” in auxiliary binary rating data are replaced with numerical values of “1 (dislike)” and “5 (like)”, respectively. For all baseline methods and our EKT method, each of them is run 1000 iterations, and the best results are reported.

4.3 Experimental results

The experimental results on ML10M and Netflix are shown in Tables 3 and 4, respectively. From these results, we can make the following observations:

  1. (1)

    For the non-transfer learning methods, we can see that GWNMTF consistently outperforms WNMTF at all sparsity levels. In addition, GWNMTF is better than PMF when the sparsity is higher (e.g. ≥ 0.5% for ML10M and ≥ 0.1% for Netflix). However, when the sparsity is lower (e.g. ≤ 0.1% for ML10M and ≤ 0.05% for Netflix), the prediction performance of GWNMTF is worse than PMF. The reason is that when the target rating matrix is denser, the neighborhood structure information between entities (users or items) obtained is more accurate. Conversely, when the sparsity is lower, the neighborhood structure information between entities obtained may be inaccurate.

  2. (2)

    Transfer learning technology is an effective method to solve the sparsity issue in collaborative filtering.

    1. (a)

      We can see that Transfer-based methods consistently outperform the non-transfer method at all sparsity levels. Prove the effectiveness of TL-based CF methods.

    2. (b)

      The proposed transfer learning methods of EKT performs better than all baseline methods at almost all sparsity levels except the denser 1% case, and the average prediction performance of our EKT method is significantly better than all baseline methods. In addition, it is interesting to see that, the sparser the target data is, our EKT method has more performance improvement than the baseline methods. The reason is that when the target data is extremely sparse, other baseline methods based on transfer learning are likely to encounter under-transfer and negative transfer issues, resulting in rapid degradation of performance. Our EKT method considers these two issues simultaneously in a unified framework, which not only can transfer more knowledge from the auxiliary data, but also effectively alleviate the negative transfer issue.

    3. (c)

      CSVD performs better than CMTF at all sparsity levels of Netflix, which is consistent with the results reported in [31]. In addition, when the target data sparsity level is 1%, CSVD achieves slightly better prediction performance than our EKT methods, indicating that when the target data is not very sparse, the orthogonal constraint in CSVD can reduce noise, thereby improving the prediction performance. However, on ML10M, when the sparsity level is less than or equal to 0.1%, CSVD performs worse than the CMTF, which shows that when the sparsity level is lower, CSVD may be unstable and positive transfer may not be guaranteed.

    4. (d)

      TMF performs better than iTCF at all sparsity levels, which is consistent with the results reported in [29]. In addition, at 1% sparse level of ML10M, TMF achieve slightly better predictive performance than our EKT method. The reason is that TMF utilizes the virtual user profile from the target data by combining the user liked items latent features and disliked items latent features. Therefore, when the target data is relatively denser, TMF can perform better.

    5. (e)

      TMF performs better than CSVD at almost all sparsity levels except the denser case of 1% and 0.5% on Netflix. TMF performs better than CSVD when the sparsity is lower, because TMF introduces the global average preference scores, the user preference biases and the item preference biases into the prediction rules, which may help for an extremely sparse rating matrix.

    6. (f)

      WNMTF-TL is significantly better than non-transfer learning methods of WNMTF and GWNMTF, which shows that transferring the similarity between entities estimated in the auxiliary data to the target data can improve the prediction performance of WNMTF and GWNMTF. However, we can see that EKT performs better than WNMTF-TL in all cases. The possible reason is that when the target data is extremely sparse, the collective knowledge transfer way used by EKT may perform better than the adaptive knowledge transfer way used by WNMTF-TL, since the collective behavior can introduce richer interactions when bridging two data sources [31].

Table 3 Prediction performance comparison on the subset of MovieLens10M data. Numbers in boldface is the best result among all methods
Table 4 Prediction performance comparison on the subset of Netflix data. Numbers in boldface is the best result among all methods

4.4 Complexity analysis

The total time complexity of the proposed EKT method is O(K(MNd1 + MNd2 + (M + N)d1d2 + M2d1 + N2d2) + M2N + NM2) if the time cost of the similarity graph construction is taken into account, It can be greatly reduced if the input data is sparse. We have also listed in Table 5 the time taken for each method to complete the task of target data sparsity level of 0.1% on the Netflix dataset. This experiment is conducted on a computer 16-GB memory and 1.6-GHz Intel Core i5. From Table 5, we can observe that the non-transfer methods are faster, because they are generally simpler and do not need to consider auxiliary data. Among the cross-domain methods, WNMTF-TL is the slowest, because it needs to solve two objective functions, and it also needs to calculate the similarity matrix of entities (users/items).

Table 5 Elapsed time Comparison on the subset of Netflix

4.5 Parameter analysis

In this section, we test how the parameters affect the performance of EKT. There are 13 parameters in the proposed EKT: λ, d1, d2, c, αu, αv, βu, βv, 𝜃u, 𝜃v, αs, βs and γs. λ is a tradeoff parameter that balances the target data and auxiliary data. d1 and d2 are the latent feature number of users and items, respectively. c is the shared rating pattern number. αu and αv are the graph regularization parameters of target data. βu are βv are the graph regularization parameters of auxiliary data. αs, βs and γs are the regularization parameters. To simplify the problem, we fixed the parameters λ = 1, αs = βs = γs = 1, d1 = d2 = 10 and let αu = αv = α, βu = βv = β, 𝜃u = 𝜃v = 𝜃, focusing only on how the parameters c, α, β and 𝜃 affect the performance of EKT. For simplicity, only the result for the subset of Netflix has been included. Data sets with five sparsity levels are used to test the four parameters. MAE and RMSE are used as evaluation metrics. Since MAE and RMSE are similar, we only show the results of RMSE.

For the parameter c, we search for the grid {1,2,3,4,5,6,7,8,9,10} to find the best parameter value. For the parameters α and β, we search for the grid {0.01,0.05,0.1,0.5,1} to find the best parameter value. For the parameter 𝜃, we search for the grid {0.1,0.5,1,5,10} to find the best parameter value. The best parameter settings for these four parameters at different sparsity levels are listed in Table 6. To analyze the parameter c, we fix α, β and 𝜃 to the best parameter values for different sparsity levels, as shown in Table 6. In Fig. 3a, we can see that for target data with different sparsity levels, setting the proper c value can help improve the accuracy of the prediction. For target data with sparsity levels of 1% and 0.5%, the highest accuracy can be achieved by setting c = 4, and for target data with sparsity levels of 0.1%, 0.05% and 0.01%, the highest accuracy can be obtained by setting c = 1. That is, when the target data is more sparse, the number of rating patterns shared by auxiliary data and target data is less, therefore c should be set smaller to avoid negative transfer. Likewise, to analyze the parameter α, β or 𝜃, we fix the remaining parameters to the best parameter values for different sparsity levels, as shown in Table 6. In Fig. 3b and c, we can observe that when the target data sparsity is higher than 0.1%, better performance improvement can be achieved by setting the proper α (α = 0.01) and β (β = 0.01). However, when the target data sparsity is equal to or lower than 0.1%, α has little effect on the performance improvement, and β has a greater impact on prediction performance. The reason is that when the target data is extremely sparse, the neighborhood structure information in the target data is difficult to be accurately obtained. At this time, by setting with the proper β (β = 0.1), we can see that the prediction performance can be improved. A reasonable explanation is that the neighborhood structure information of the auxiliary data is transferred to the target data to help refine the latent factors of the target data. In Fig. 3b, we can see that the sparser the target data is, a larger 𝜃 is required to obtain better prediction performance. The reason is that the sparser the target data needs to transfer more knowledge from the auxiliary data. However, when choosing a higher 𝜃 value, it will take more time to run the algorithm. To tradeoff between an acceptable running time for the algorithm and relatively high predictive performance, in our experiments, we set 𝜃 = 5 (sparity≥ 0.05%) and 𝜃 = 10 (sparity= 0.01%).

Table 6 The best parameter setting of c, α, β and 𝜃 at different sparsity levels
Fig. 3
figure 3

Result of RMSE with different parameter settings on the subset of Netflix

5 Conclusions and future work

We proposed an Enhanced Knowledge Transfer for Collaborative Filtering with Multi-Source Heterogeneous Feedbacks (EKT), which transfers more useful knowledge from auxiliary data with explicit binary ratings, reducing the sparsity of the target numerical data. By sharing latent user preferences, latent item feature and partial user-item rating pattern between target data and auxiliary data, EKT method can achieve more complete knowledge transfer, while alleviating negative transfer issue by integrating graph co-regularization of user and item graphs into the weighted collective matrix tri-factorization. Experimental results on two benchmark datasets verify the effectiveness of the presented EKT method. And in the case of extremely sparse target data, our EKT method can still achieve relatively good prediction performance.

For future work, our main interest is to extend our EKT method to heterogeneous feedback scenarios with multiple heterogeneous auxiliary sources.