1 Introduction

In the era of information overload, the recommender system exists as an indispensable tool and is widely used in a variety of online services, including e-commerce, short videos, and social networking sites. The key to personalized recommendation is to screen out items that the user may like according to their past interactions, which is called collaborative filtering [1,2,3,4]. Collaborative filtering is the most influential and widely used model in the field of recommender systems. Among many collaborative filtering models, matrix factorization has the most generalization ability and can deal with sparse matrices and other characteristics. As shown in Fig. 1, taking movie recommendation as an example, matrix factorization [5, 6] projects users and items into a shared latent space, in which the recommender system predicts a personalized ranking over a set of items for each user with the similarities among the users and items. The recommendation task here is to use the user’s history to predict the ratings of unwatched movies. During the history of the interactions between the users and the items, there are two acts, explicit feedback, and implicit feedback. Explicit feedback includes the user’s ratings or views on the item, which can directly show the user’s preferences and can also reflect the user’s preferences. While implicit feedback includes the user’s purchase, click, collection, etc., which cannot reflect the user’s preference directly but can be used to mine the user’s preferences.

Fig. 1
figure 1

Collaborative filtering for recommender sytem. \({R_{M \times N}}\) represents M users’ rating matrix for N movies, and \({U_{M \times K}}\) and \({I_{K \times N}}\) are low-dimensional representations of users and movies, respectively

In the past few years, since Deep Neural Networks (DNNs) are extremely good at representation learning, deep learning methods have been widely explored and have shown promising results in various areas such as computer vision and natural language processing [7,8,9]. Xue et al. [6] proposed a Deep Matrix Factorization (DMF), which uses a neural network architecture to replace the linear embedding operation used in vanilla matrix factorization. It uses the rows and columns in the user-item rating matrix as high-dimensional vector representations of users and items, and maps them to low-dimensional space through DNNs. In addition to learning better representation for users and items, DNNs are very suitable for learning complex interaction functions because they can approximate any continuous function [10]. NCF [11] was proposed to model the user-item interactions with a multi-layer feedforward neural network. it uses the concatenated vectors of user ID embedding and item ID embedding as input to the multi-layer perceptron (MLP) model for prediction. Using the high capacity and nonlinear characteristics of DNNs, we can learn the complex mapping relationship between user-item representations and predict scores. Recently, Rendle et al. [12] revisited the experiments in the NCF paper, proving that under the same experimental settings, the vanilla matrix factorization model after tuning can be significantly better than MLP in simulating the interaction between users and items. Obviously, the above two methods feed the neural network data differently. In this paper, we call them explicit data and implicit data respectively. Although some recent advances [13,14,15] have applied DNNs to recommendation tasks and shown promising results, they mostly used DNNs to model auxiliary information, such as a textual description of items, audio features of pieces of music, and visual content of images. However, they all use auxiliary information to learn the representation vectors of users and items, and do not consider the difference between explicit feedback and implicit feedback.

According to the above discussion, we can see that it seems feasible to learn to represent users and items by considering explicit data and implicit data. With this assumption, we propose a collaborative filtering framework that combines two types of feedback and uses multi-task learning optimization called Multi-task learning for collaborative filtering (MTCF). Our proposed framework has three optimization tasks. We first use these two types of feedback data to perform vanilla matrix factorization tasks to obtain predicting scores of items and then cross the low-dimensional representations of users and items produced in the two matrix factorization tasks. The specific method is to cross the explicit user representation and the implicit user representation to obtain a higher-order integrated user representation, and we use the same method for items. After obtaining the comprehensive representation, we then use DNNs to learn the complex mapping relationship between the two types of feedback features and the integrated features and finally obtain the predicting score through user-item interaction. It is worth mentioning that we are not only outputting the predicting score after comprehensive crossover but outputting all three modules’ predicting scores as our final main task’s predicting score.

Figure 2 illustrates our key ideas. We take user representation as an example. For the representation learning of single feedback data, there is only one user representation at the end, but for our model, considering explicit feedback data and implicit feedback data, there will be at least three user representations. Multiple user representations reflect different features of users, which provides more benefits for our next recommendation task.

Fig. 2
figure 2

User representation learning with single feedback data and user representation of our model

The main contributions of this work are as follows.

  • We propose a new collaborative filtering framework that can utilize explicit feedback information and implicit feedback information at the same time, and mine the cross-features of the two types of feedback information. It is worth mentioning that the model has good generalization performance and can be widely used in most scenarios with explicit feedback and implicit feedback.

  • We use multi-task learning to train our model, give full play to the generalization ability of the model, accelerate the convergence of the model, and improve the effectiveness of the model. As far as we know, we are the first to use multi-task learning for collaborative filtering that combines implicit and explicit feedback.

  • We perform extensive experiments on 5 real-world datasets to demonstrate the effectiveness and rationality of the proposed MTCF framework.

2 Related work

2.1 Collaborative filtering

In the recommender system, the historical interaction information between the user and the item is the key to collaborative filtering, and the user’s explicit feedback just reflects the user’s preference. Collaborative filtering with explicit data uses the users’ direct ratings or comments on items to perform scoring prediction tasks. Singular Value Decomposition (SVD) [16] is an early model for the matrix factorization method, which predicts user ratings on items by decomposing the rating matrix into two small matrices. After, Lee et al. [17] proposed a non-negative matrix method, which enhanced the interpretability after matrix factorization. The success of the Netflix Prize has set off a wave of research on recommendation algorithms. Salakhutdinov et al. [18] applied the restricted Boltzmann machine to the Netflix dataset with great success, and then the model was extended to the item order scoring task. Immediately after, Georgiev et al. [19] proposed a hybrid RBM framework that used both user-based and item-based RBM frameworks, used the rating matrix as input to learn the hidden layer distribution, and tried to reconstruct the rating matrix. Recently, DMF was [6] proposed to use a bidirectional path neural network architecture to replace the linear embedding used in matrix factorization, and design a new loss function to optimize the model. This direct use of the rating matrix as an input only makes use of explicit feedback data.

Since most users do not tend to rate items, it is often difficult to collect explicit feedback data. ALS [8] and SVD++ [20] are two early effective tasks that use implicit feedback for recommendation tasks. Both models ignore the rating value and use binarized implicit feedback for the recommendation. NCF [11] was proposed to use element product to replace the inner product operation in traditional matrix factorization, and interprets the matrix factorization method as a special case of the NCF method. Further, NCF has two modules Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP) to learn the relationship between linear and non-linear. Using linear fusion to fuse the two models to improve model performance. Bai et al. [21] proposed a new Neighbor-based Neural Collaborative Filtering (NNCF) model. For the first time, the neighborhood model was integrated into neural collaborative filtering. This method improves the NCF model performance by constructing user-item neighborhoods as input. In the collaborative filtering model based on deep learning, Zhang et al. [18, 22] introduced the attention mechanism into the recommender system to learn the relative weight of each user-item interaction to better learn the user’s instantaneous interests. However, they only resort to implicit data when building a model.

Some studies have found that there is a complementary relationship between explicit feedback and implicit feedback [23,24,25], and applying both of them at the same time is likely to improve the recommendation effect. Robert M. Bell and Yehuda Koren [26] cut in from the perspective of combining explicit feedback and implicit feedback, mining explicit implicit feedback data from movie recommendations, using score data as explicit feedback, and factoring and based on the neighborhood model fuse these two types of feedback data for recommendation tasks. Weike Pan [27] first clusters the user set and item set through K-means and proposes a factorization machine model that incorporates explicit implicit feedback based on transfer learning. Gai Li [28] combines the advantages of xCLiMF [25] and SVD++ [20] for recommendation tasks while combining explicit and implicit feedback, and proposes a new evaluation method ERR (Expected Reciprocal Rank) to evaluate the recommendation quality of the algorithm. Chen et al [29]. performed weighted low-rank processing on implicit feedback data to better leverage the ability of implicit feedback data to reflect users’ hidden preferences. Liu et al. [24] considered the heterogeneity of explicit implicit feedback, that is, explicit feedback is mostly numerical, and implicit feedback is mostly binary, to eliminate the numerical difference between the two, both Convert to a value between 0-1 and set different weights for them respectively. However, through this approach, the ability of explicit feedback data to reflect the user’s preference is weakened, and the important characteristics of explicit feedback data have not been considered. What is different from the above work is that we retain the respective characteristics of explicit data and implicit data, and mine the deep data comprehensive characteristics through neural networks, and use appropriate methods for training. We make full use of implicit feedback data to reflect users’ hidden preferences and explicit feedback data to reflect user degree of preference. In Table 1, we summarize all the pertinent characteristics of implicit and explicit feedback.

Table 1 Characteristics of explicit and implicit feedback

2.2 Multi-task learning

Multi-task learning [30] is a derivation transfer learning method. The main task uses the domain-related information possessed by the training signals of the related tasks as a derivation bias to improve the generalization effect of the main task. In recent years, multi-task learning has become more and more popular, because the success of machine learning and deep learning is mainly due to the model’s better access to data representation and the ability to mine the required information from the data. Multi-task learning can obtain more comprehensive and changeable information from the data. The features extracted by the single task model are only valid for the single task, and a single feature cannot describe a sample well. When the amount of tasks is large and the learned features are required to serve each task, that is, the features are required to have a certain generality, multi-task learning is more suitable. Multi-task learning is generally divided into two types, one is divided into one main task and auxiliary tasks, the auxiliary tasks are to help the main task to train. The other is multiple Equal tasks, there is no major or minor. The former is used in our work. Choosing appropriate auxiliary tasks is the key to the success of the multi-task learning framework [31].

Fig. 3
figure 3

General multi-task learning model framework. The Shared-Bottom network is usually at the bottom, denoted as f, and multiple tasks share this layer. Up, the K subtasks correspond to a tower network, denoted as \(h_{K}\), and the output of each subtask is \(y_{K}=h_{K}(f(x))\)

Multi-task learning has many advantages in recommending tasks [32]. For example, multiple tasks can share a part of the network structure as Fig. 3, and the learned user and item vector representations can be easily migrated to other tasks. Besides, the correlation between different tasks has a greater impact on the multi-task learning effect. In our work, the auxiliary task we use is part of our model, so it has a high correlation with the main task. Recently, multi-task learning is used to solve multiple problems simultaneously in the recommender system. Based on the user’s decision-making process, Hadash et al [33]. divided the recommendation task into a ranking task and a scoring task, and proposed a multi-task framework to jointly train these two tasks. This is the first work to apply multi-task learning to collaborative filtering. Lu et al [34]. proposed a multi-task learning framework that combines probabilistic matrix factorization (PMF) and adversarial Seq2Seq model. The matrix factorization model can be used to obtain user ratings for items. The Seq2Seq model can generate user comments on items and improve recommendations. While predicting the accuracy, it can solve the difficulty of providing interpretable recommendation results in the recommendation system to a certain extent. Based on the idea of Multi-Task Learning, Ma et al [35]. proposed a new CVR estimation model—ESMM, which effectively solved the two key problems of data sparseness and sample selection bias faced by CVR estimation in real scenes. With the background of Taobao search and recommendation scenarios, Ni et al [36]. used a multi-task model to learn the general representation of users, and compared some experimental effects of the multi-task model and the single-task model. Zhao et al [37]. used multi-task learning to predict the two key tasks of video recommendation scenarios, including whether the user would click on the video and the user’s feedback after watching the video. The above two key tasks can be used as implicit feedback tasks and explicit feedback tasks respectively. Inspired by the video recommendation work, we further use multi-task learning for collaborative filtering, using two types of feedback data to construct different collaborative filtering tasks.

3 Preliminaries

Suppose there are M users \(U= \{ {{\mathrm{u}}_1},...,{{\mathrm{u}}_M}\}\), N items \(I=\{{{\mathrm{i}}_1},...,{{\mathrm{i}}_N}\}\). Each item can be a book, a movie, or a web page. In the explicit feedback data, let \(R \in {R^{M \times N}}\) denote the rating matrix, where \({R_{ui}}\) is the rating of user u on item i.

$$\begin{aligned} {y}_{u i}=\left\{ \begin{array}{l} R_{u i}, \text { if } R_{u i} \text { is observed } \\ 0, \text { otherwise } \end{array}\right. \end{aligned}$$
(1)

In the implicit feedback data, let \(A \in A^{M \times N}\) denote the interact matrix, where \({A_{{\mathrm{ui}}}}\) may be click, favorite, browse, etc.

$$\begin{aligned} y_{u i}=\left\{ \begin{array}{l} 1, \text { if interaction }(u, i) \text { is observed } \\ 0,\text { otherwise } \end{array}\right. \end{aligned}$$
(2)

In particular, for two types of feedback information, \({{\mathrm{y}}_{{\mathrm{ui}}}} = 0\) does not mean user u does not like i. In fact, there are too many items in a system, and user u may have never observed item i. The recommendation problem with explicit feedback is usually formulated as a rating prediction problem that estimates the missing values in the rating matrix R. Finally, we select top-k items to recommend to users by sorting the predicted scores of the items. Similarly, to settle the recommendation problem with implicit feedback, we can formulate it as an interaction prediction problem that estimates the missing values in the interaction matrix [38]. It should be emphasized that in order to eliminate the difference between the predicted values of the two types of recommendation problems, we convert them into values between 0–1. Model-based approaches [20, 39] assume that there is an underlying model which can generate all ratings as follows.

$$\begin{aligned} \hat{y}_{u i}=f(u, i \mid \varTheta ) \end{aligned}$$
(3)

Where \({\hat{y}_{ui}}\) denotes the predicted score of interaction matrix between user u and item i,\(\varTheta\) denotes the model parameters, and f denotes the function that maps the model parameters to the predicted scores. The key to the problem is how to define such a function f. Let \({p_u}\) and \({q_i}\) denote the latent representations of u and i, respectively. Latent Factor Model (LFM) [40] simply applied the dot product of \({p_u}\), \({q_i}\) to predict the \({\hat{y}_{ui}}\) as follows.

$$\begin{aligned} \hat{y}_{u i}=f(u, i \mid \varTheta )=p_{u}^{T} q_{i} =\sum _{k=1}^{K} p_{u k} q_{i k} \end{aligned}$$
(4)

where K denotes the dimension of the latent space, \(K\ll \min (M,N)\). In addition, based on the calculation of the similarity between the user and the item to reflect the predicted score, we can use cosine similarity to predict \({\hat{y}_{ui}}\) as follows.

$$\begin{aligned} \hat{y}_{u i}=f(u, i \mid \varTheta )=\mathrm{cosine} \left( p_{u}, q_{i}\right) =\frac{p_{u}^{T} q_{i}}{\left\| p_{u}\right\| \left\| q_{i}\right\| } \end{aligned}$$
(5)

Both dot product and cosine similarity are used in our work. In general matrix factorization using implicit feedback data [6, 11, 12, 40], the dot product is often used to calculate the similarity between users and items due to its excellent performance. But unlike the binary implicit feedback, the explicit feedback is mostly numeric, and its value reflects the user’s degree of interest. In order to choose a suitable similarity measure for explicit feedback data, we conduct some preliminary experiments and find that cosine similarity stands out among many similarity calculation methods. So we use dot product for implicit feedback and cosine similarity for explicit feedback. Neural collaborative filtering proposes to use MLP to automatically learn f. Their motivation is to learn the nonlinear interaction between users and items. We did not follow neural collaborative filtering, because we tried to learn the explicit and implicit higher-order features of users and items through a deep representation learning framework to obtain users’ comprehensive interests.

Now, the next question is how to learn model parameters, and many of the existing works generally estimate parameters through optimizing an objective function. Recommender systems are often abstracted into learning to rank or predicting rating, often using two types of objective functions, pair-wise loss, and point-wise loss. In this paper, we predict the user’s rating of items based on the user’s explicit feedback information. Point-wise loss is widely used in collaborative filtering regression models based on explicit feedback [41]. For our regression prediction object, we use the most commonly used point-wise loss which is the squared loss to learn the parameters by minimizing the squared error between \({y_{ui}}\) and \({\hat{y}_{ui}}\).

$$\begin{aligned} L_{s q r}=\sum _{(u, i) \in y^{+} \cup y^{-}} w_{u i} \left( y_{u i}-\hat{y}_{u i}\right) ^{2}+\lambda \Vert \theta \Vert _{2}^{2} \end{aligned}$$
(6)

Where \(y^{+}\) denotes all the observed interactions and \(y^{-}\) denotes the sampled unobserved interactions, and \(w_{u i}\) denotes the weight of training instance (ui). \(\theta\) denotes all trainable model parameters and \(\lambda\) controls the \({L_2}\) regularization strength to prevent overfitting. we adopt the Adaptive Moment Estimation (Adam) [42], which adapts the learning rate for each parameter by performing smaller updates for frequent and larger updates for infrequent parameters. The Adam method yields faster convergence than Stochastic Gradient Descent (SGD) and gets out of trouble of tuning the learning rate. In summary, our recommendation task can be described as a problem of predicting scores, through the vector representation of users and items to interact with each other to obtain the predicting scores of items. Finally, the model is optimized by minimizing the squared loss.

Fig. 4
figure 4

The architecture of Multi-task learning for collaborative filtering Models. The orange and green parts are two auxiliary tasks that use implicit data and explicit data, respectively

4 The proposed framework

Our framework aims to make full use of explicit feedback data and implicit feedback data, and considering the generalization of the model, our framework has three tasks, two auxiliary tasks, and one main task. Figure 4 illustrates our proposed architecture. The green and orange parts are extracted separately as our two auxiliary tasks, the collaborative filtering task with implicit data (ICF) and the collaborative filtering task with explicit data (ECF). The entire framework is the main task of our training.

In particular, for the conversion of explicit feedback data to implicit feedback data, we take the MovieLens dataset as an example. Movie ratings include 1, 2, 3, 4, 5 (observed), and missing value (unobserved). There are three main ways to convert explicit feedback to implicit feedback:

  1. (a)

    \({\text{rating}} \ge 3,r = 1\)(observed, positive sample);

    otherwise, \(r = 0\)(unobserved, negative sample);

  2. (b)

    \({\text{rating}} \ne \emptyset ,r = 1\)(observed, positive sample);

    otherwise, \(r = 0\)(unobserved, treat all missing items as negative samples);

  3. (c)

    \({\text{rating}} \ne \emptyset ,r = 1\)(observed, positive sample);

    otherwise, \(r = 0\)(unobserved, sample all missing items and select some as negative samples);

Where \(\emptyset\) means there are no ratings. The third processing method is adopted in our work.

In real scenarios, user behaviors often contain different implicit feedbacks such as click, favorite, and browse, and these different feedbacks may reflect user interests. Taking different implicit feedbacks into consideration may improve recommendation performance to some certain. Owing to the unavailability of data containing different implicit feedbacks and the limited computing resource, all the different feedbacks are not treated differently in this paper. In fact, the proposed model in the article is easily extended to utilizing different implicit feedbacks for the recommendation, and this is also the focus of our future work.

4.1 Collaborative filtering task with explicit data and implicit data

In the orange part, we use the one-hot encoding of the user (item) ID as the input and let \(V_u^U(V_i^I)\) denotes the one-hot encoding of the user (item) ID. Let user latent vector \({p_u}\) and item latent vector \({q_i}\) be represented as follows.

$$\begin{aligned} p_{u}&=P^{T} V_{u}^{U} \nonumber \\ q_{i}&=Q^{T} V_{i}^{I} \end{aligned}$$
(7)

We use Eq.(4) to predict item scores. From this, the output of collaborative filtering task with implicit data (ICF) can be defined as follows.

$$\begin{aligned} \hat{y}_{u i}^{I C F}=a_{\mathrm{out}}\left( p_{u}^{T} q_{i}\right) \end{aligned}$$
(8)

Where \({a_{out}}\) denotes the activation function, using \(\text { sigmoid }(x)=1 /\left( 1+e^{-x}\right)\) to map our output value between [0,1]

In the green part, we use the interact matrix \({y^{M \times N}}\) as input, and Each row \({y_{u*}}\) and each column \({y_{*i}}\) in y represent a user and an item, respectively. Since the initial input is a high-dimensional vector with two different dimensions, we must map it to a low-dimensional space of the same dimension to facilitate our subsequent operations. We simply use a linear regression function to complete this mapping, and the latent vectors of users and items can be defined as follows.

$$\begin{aligned} p_{u}&=y_{u *} W_{u}+b_{u} \nonumber \\ q_{i}&=y_{* i} W_{i}+b_{i} \end{aligned}$$
(9)

where \({W_*}\) and \({b_*}\) denote the weight matrix and bias vector, respectively. Here we use Eq.(5) to predict item scores. From this, the output of collaborative filtering task with explicit data (ECF) can be defined as follows.

$$\begin{aligned} \hat{y}_{u i}^{E C F}=a_{\mathrm{out}}\left( \mathrm{cosine}\left( p_{u}, q_{i}\right) \right) \end{aligned}$$
(10)

Similarly, where \(a_{\mathrm{out}}\) denotes the activation function, using \(\text { sigmoid }(x)=1 /\left( 1+e^{-x}\right)\) as \(a_{\mathrm{out}}\) to map our output value between [0,1].

In the previous two parts, we got the user and item representations under two types of feedback data. Then we cross the two representations to obtain a more complex representation. Here we use the product of elements (\(\odot\)) to complete the cross-features. The latent vectors of new users and items are represented as follows.

$$\begin{aligned}&p_{u}=p_{u}^{I C F} \odot p_{u}^{E C F} \nonumber \\&q_{i}=q_{i}^{I C F} \odot q_{i}^{E C F} \end{aligned}$$
(11)

Next, we use MLP to further learn the comprehensive latent representation of users and items. Therefore, the user’s representation learning part can be defined as follows.

$$\begin{aligned}&a_{0}=W_{0}^{T} p_{u} \nonumber \\&a_{1}=a\left( W_{1}^{T} a_{0}+b_{1}\right) \nonumber \\&\qquad \cdots \cdots \nonumber \\&p_{u}=a_{x}=a\left( W_{x}^{T} a_{x-1}+b_{x}\right) \end{aligned}$$
(12)

where \({W_x}\), \({b_x}\), and \({a_x}\) denote the weight matrix, bias vector, and activation function for the x-th layer’s perceptron, respectively. In this paper, we use Rectifier (ReLU) as the activation function. The same method can be used to obtain the comprehensive potential representation of the item \({q_i}\). Finally, we also use cosine similarity to predict item scores. From this, the IECF output can be defined as follows.

$$\begin{aligned} \hat{y}_{u i}^{I E C F}=\mathrm{cosine}\left( p_{u}, q_{i}\right) \end{aligned}$$
(13)

Finally, we can get the final output of the main task as follows.

$$\begin{aligned} \hat{y}_{u i}^{\mathrm{main}}=\mathrm{sigmoid}\left( \hat{y}_{u i}^{I C F} +\hat{y}_{u i}^{I E C F}+\hat{y}_{u i}^{E C F}\right) \end{aligned}$$
(14)

4.2 Multi-tasks

To effectively learn parameters for the recommendation, as well as preserve the generalization ability of the framework, we use ICF and ECF as independent auxiliary tasks and our entire model as the main task. Compared with the main task, the two auxiliary tasks are relatively simple, so the multi-task learning strategy we adopt is to combine simple tasks and complex tasks. When training on three tasks, our training set is the same, and all the parameters of the model are shared, which achieves the effect of knowledge transfer and can accelerate the model convergence. For the optimization of the three tasks, we use the square error loss function of Eq.(6).

$$\begin{aligned} L_{\mathrm{main}}&=\sum _{(u, i) \in y^{+} \cup y^{-}} w_{u i} \left( y_{u i}-\hat{y}_{u i}^{\mathrm{main}}\right) ^{2} \nonumber \\ L_{I C F}&=\sum _{(u, i) \in y^{+} \cup y^{-}} w_{u i} \left( y_{u i}-\hat{y}_{u i}^{I C F}\right) ^{2} \nonumber \\ L_{E C F}&=\sum _{(u, i) \in y^{+} \cup y^{-}} w_{u i} \left( y_{u i}-\hat{y}_{u i}^{E C F}\right) ^{2} \end{aligned}$$
(15)

For the convenience of calculation, we map our output value \(\hat{y}_{u i}\) between [0,1]. So, here all \(y_{u i}\) comes from Eq.(2).

Generally, there are two training methods for multi-task learning, one is alternating training and the other is joint training. Alternate training means that in iterative training we alternately perform loss learning for each task, and joint training means that we train all tasks in each epoch and then integrate their respective losses. Wang et al. [43] also use multi-task learning when they use knowledge graphs to complete recommended tasks. They alternately train recommendation tasks and knowledge graph embedding tasks. As can be seen from their code, for every 5 epochs, 4 of them are in the training recommendation task, and the remaining one is in the training knowledge graph embedding task. Xin et al. [44] divided recommendation tasks into the user-item preference modeling task and the item-item relationship modeling task. They jointly trained the two tasks to obtain a total objective function. In our work, we adopt the latter.

The next key question is how to integrate multiple losses, and the first method we tried was to simply add up the different losses. Soon we found that the scale of the loss of different tasks is very different, resulting in the overall loss being dominated by a certain task, and ultimately leading to the loss of other tasks that cannot affect the learning process of the network shared layer. Moreover, when the loss of the main task is very small, we do not want the auxiliary task to change the model parameters significantly, so we designed a total objective function as follows.

$$\begin{aligned} \min _{\theta } L=L_{\mathrm{main}}+L_{\mathrm{main}}\left( \alpha L_{I C F} +\beta L_{E C F}\right) \end{aligned}$$
(16)
Fig. 5
figure 5

The loss of two auxiliary tasks training alone

Where \(\alpha\) and \(\beta\) are relative weights of auxiliary tasks. Before the experiment, to explore the relative weights \(\alpha\) and \(\beta\), we conduct some preliminary experiments. From Fig. 5, taking ML100K dataset as an example, we can see that the ECF training doesn’t take long before the loss is almost fixed, while the loss of ICF continues to decline. The convergence speed of ECF is faster than ICF, and we hope that ICF will be fully trained. When the ICF converges quickly, the loss of the two auxiliary tasks is about 4 times the relationship. In Eq.(15), so we try to increase the relative weight of the loss of ICF. In our work, for ML100K dataset, we simply set \(\alpha =4\), \(\beta =1\). In the future, we will deeply explore the relationship between auxiliary tasks. The training procedure for MTCF is illustrated in Algorithm 1.

figure a

5 Experiments

In this section, we prove the effectiveness of our proposed framework through experiments and perform a series of extensive experiments to compare the performance of different experimental settings, such as the number of negative samples and the number of network layers.

5.1 Experimental settings

5.1.1 Datasets

We evaluate our proposed framework on five benchmark datasets: MovieLens 100K (ML100k), MovieLens 1M (ML1M), LastFM, Amazon music (AMusic), Amazon toy (AToy). The MovieLens datasets have been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. For the LastFM dataset, we do not filter any users and ratings, and use this version of the dataset directly. For the other 3 datasets, we use the same method as MovieLens. The statistics of these five datasets are summarized in Table 2.

Table 2 Statistics of the evaluation datasets

5.1.2 Evaluation for recommendation

To evaluate the performance of item recommendation, we adopted the leave-one-out evaluation, which has been widely used in the literature [5, 11, 12, 45]. The latest interaction of each user is used for testing and the remaining dataset for training. Since ranking all items is time-consuming, we randomly sample 99 unobserved interactions for each user. We then rank the 100 items according to the prediction. The performance of a ranked list is often measured by Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) [46]. In our experiments, we truncated the ranked list at 10 for both metrics. Intuitively, the HR measures whether the test item is present on the top-10 list or not, and the NDCG measures the ranking quality which assigns higher scores to hit at top position ranks.

5.1.3 Adapting to temporal changes

A key assumption of most machine learning models is that the input is independent and identically distributed. This is not strictly true in the field of recommendation since the user’s behavioral preferences change with time, recent behaviors can better represent the current user’s preferences. In addition, the recommender system task is to predict the user’s next click. Therefore [47], learning a recommendation model on the entire dataset may lead to worse performance because the model ends up focusing on some out-of-date properties. One way to deal with this is to discard early data, but this will reduce the amount of our training data. We propose a simple solution to get the best of both worlds via pre-training. We first use the entire dataset to pre-train a model, and then further train the model using only a subset of recent data, e.g. the last week worth of data out of a month of interactions.

5.2 Performance comparison

We compared our proposed MTCF method with the following methods. Since the proposed models focus on modeling the relationship between users and items, we mainly compare with user-item models.

  • ItemPop [19] is a non-personalized method that is often used as a benchmark for recommendation tasks. It ranked the items by their popularity judged by the number of interactions.

  • ItemKNN [3] is the standard item-based collaborative filtering method.

  • eALS [5] is a state-of-the-art MF method for recommendation with square loss. It used all unobserved interactions as negative instances and weighted them non-uniformly by the item popularity.

  • MF is the standard matrix factorization that models the user preference with the inner product between user and item embeddings.

  • NeuMF [11] is a state-of-the-art representation learning-based MF method which performs deep matrix factorization with normalized cross-entropy loss as the loss function.

  • DMF [6] is a state-of-the-art representation learning-based MF method which performs deep matrix factorization with normalized cross-entropy loss as the loss function.

  • DeepCF [38] is an improved algorithm based on representation learning and matching learning, which combines the advantages of the two models and uses a high-dimensional vector of user-item explicit feedback as input for users and items.

  • EIFCF [29] is a collaborative filtering algorithm that integrates implicit feedback and explicit feedback in stages.

We use the Pytorch framework to implement our proposed method. All MTCF tasks are learned by optimizing the squared loss of Eq.(6), where we use uniformly a two-layer neural network. It is worth noting that for the neural network, we randomly initialize the training parameters to a Gaussian distribution (with a mean of 0 and standard deviation of 0.01), and Use the mini-batch Adam optimization model. We set the batch size to 512 and the learning rate of [0.0005,0.0001,0.00005].

The results of the comparison are summarized in Table 3. The best and the second best scores are shown in bold. According to the table, we have the following key captures:

  • Our results prove the feasibility of neural networks in the recommendation problem. Although the MF performs better than state-of-the-art representation learning-based methods on high-density (\(density > 1\%\)) datasets, the latter are better on sparse datasets. However, as for proposed architecture, on all datasets, our models achieve the best performance in both metrics of NDCG and HR, compared to other methods.

  • In particular, on sparse datasets, compared to state-of-the-art representation learning-based methods, our model obtain 21.3-24.4% (22.9% average) and 21.3-22.5% (21.9% average) relative improvements in NDCG and HR metrics, respectively. Even compared to the MF on high-density datasets, our model also gets 0.7-1.5% (1.1% average) and 0.4-2.3% (1.4% average) relative improvements in NDCG and HR metrics, respectively.

  • In terms of explicit feedback and implicit feedback, our method is significantly better than DMF, which only uses explicit feedback data, and is also better than NeuMF, DeepCF, etc., which rely only on implicit feedback. In addition, for EIFCF, which combines explicit feedback and implicit feedback in stages, our multi-task learning strategy shows better results.

Table 3 Comparison results of different methods in terms of HR@10 and NDCG@10

5.3 Impact of pre-training

In our work, we adapt time changes for pre-training as mentioned earlier. We first use the entire dataset to pre-train a model, and then further train the model using only a subset of recent data. In order to show that this training strategy is not only effective for our model, but also applicable to other methods, we have performed pre-training operations on the above methods, and the results are shown in Table 4. There are two points to note. For one thing is that we only choose the strongest MF for the non-deep neural network model, for another is that if there is an improvement, we show it in bold. From Table 4, we can observe that almost most methods have improved on most datasets. Look carefully, the improvement is more obvious on the high-density dataset, but the improvement is less on the sparse dataset, and some may even cause performance degradation due to insufficient training data. However, in general, our method is better than any baseline method, especially in MovieLens datasets. Even on sparse datasets, there is a small improvement.

Table 4 Comparison results with pre-training of different methods in terms of HR@10 and NDCG@10

5.4 Research on the effectiveness of multi-task learning

5.4.1 Time efficiency

Fig. 6
figure 6

Efficiency comparison between MTCF and other baseline and state-of-the-art models on average runtime per epoch on ML1M dataset

Generally, traditional neural networks with iterative training mechanisms may cause time-consuming training of the entire model [48,49,50], so we discussed the training time of the model. Figure 6 shows the running time comparison of MTCF and baseline models and state-of-the-art models on ML1M. It is worth noting that we use the leave-one-out test method, so the test time is not much different. The vertical axis shows the average running time of each epoch of all models and the hardware settings of all models are consistent. From the figure, we can find that during training, compared with other methods, MTCF has the longest training time per epoch. However, in each training, our model has three optimization tasks and compared with other single-task neural network models, the time is not much. Compared with other non-neural network models MF and EIFCF, the time of our proposed model is mainly spent on constructing multiple representations of user items.

5.4.2 Ablation Study of multi-task learning

In order to explore the effectiveness of multi-task learning, we removed two auxiliary tasks and performed an ablation experiment on ML1M dataset. The experimental results are shown in Fig. 7. It can be seen from the figure that the multi-task model with auxiliary tasks does not perform as well as the single-task model due to the noise of the multi-task at the beginning. But after about 30 epochs, the multi-task model will surpass the single-task model with its own advantages.

Fig. 7
figure 7

The effect of multi-task learning strategy on ML1M dataset

5.5 Sensitivity to hyper-parameters

Although the time factor pre-training mode has higher performance for high-density datasets, it is not so obvious for sparse datasets, and may even decline. Therefor, in order to make the experiment fairer and to explore the sensitivity to hyper-parameters of the model itself, our experiments in this section did not use pre-training.

5.5.1 Negative sampling ratio

Compared with the pair-wise objective function, one advantage of point-wise loss is the flexible sampling ratio for negative instances [51, 52]. To illustrate the impact of negative sampling for MTCF method, we test different negative sampling ratios, i.e., the number of negative samples per positive instance, on all the datasets. From the results in Fig. 8, we can find that the number of sampling instances is also related to the sparsity of the dataset. The more sparse the dataset, the fewer the number of negatives. For ML1M dataset, the optimal negative sampling ratio is around 3 to 5 which is consistent with the results by previous work [11]. Sampling more negative instances not only requires more time to train the model but also reduces the performance of the model.

Fig. 8
figure 8

The effect of negative sampling ratio on performance

Fig. 9
figure 9

The performance with different deep layers

5.5.2 Depth of layers in network

In our proposed model, we cross explicit feedback and implicit feedback data to get the complex features of both through the neural network with multiple hidden layers. We conduct an extensive experiment on the datasets to investigate our model with the different number of hidden layers. From the results in Fig. 9, the neural network has indeed achieved a good performance, compared with the zero-layer. The above performance is particularly evident on the Amazon datasets (Amusic, Atoy). The deeper the depth, the better the performance on the Atoy dataset. While on the Amusic dataset, it is not so stable, and when the number of layers exceeds three, the performance starts to decrease. on the two MovieLens datasets, our model with two layers illustrates the best performance, which is similar to the results by previous work [6, 11]. In addition, it seems that it is not obvious for lastFM dataset to add layers, but it also has certain effects.

6 Conclusion

In this work, we propose a new collaborative filtering framework. We are the first to mine the relationship between users and items from a multi-task perspective, combining user explicit and implicit feedback data. In our proposed framework, we make full use of both explicit ratings and implicit feedback in two ways and design a neural network to cross the explicit and implicit representations. We conducted extensive experiments on five real-world datasets and demonstrated the effectiveness of our MTCF method. This work responds to some recent questions about using deep neural networks to complete recommendation tasks.

As for future work, we will look for some datasets with a large difference between explicit feedback and implicit feedback to conduct further experiments and improve our framework. Because the user’s behavioral preferences change with time, in our work, we simply use the interaction of the last week to reflect the user’s recent performance is not comprehensive. So, we will study time-aware models to capture the evolution of user preference.