1 Introduction

With the rapid development of Internet technology, the amount of information on the Internet is growing exponentially, resulting in a serious problem of information overload. To overcome this problem, the recommendation systems (RS) emerge as the times required, which can help users to find the information that feeds their requirements [22]. RS can not only filter irrelevant information according to users’ preferences, but also provide personalized recommendations for users. Among existing techniques of personalized recommendation [3, 16], collaborative filtering (CF) [18] is one of the most widely used ones. It has been widely used in many commercial websites, such as Amazon, YouTube, Netflix, E-commerce, Taobao, Douban and Last.fm. CF-based methods attempt to exploit the available user-item rating data to make predictions about users’ preferences, in which they can be divided into two major categories [41]: memory-based and model-based. Specifically, the former [14] makes predictions based on the similarities between users and items, while the latter [25] tries to create a prediction model by utilizing machine learning.

Although CF has achieved huge success over the past two decades, some important issues in RS still remain unsolved, as follows:

  • Sparsity Problem: In real-world, most users typically rate or experience only a small fraction of the available products. As a result, the density of the available ratings in the recommendation systems is always less than 1% [17]. Owing to this data sparsity, CF methods have a number of difficulties in trying to classify related users in the system. Therefore, the predictive quality of the recommendation system may be significantly restricted.

  • Cold-Start Problem: So far, this is a universal challenge in the recommendation system. Here, the cold-start refers to the users who have expressed no or a few ratings, or items which have been rated by no or a few users. Owing to the lack of sufficient rating data, the similarity-based approaches fail to find out the nearest neighbor users or items and in turn, have degraded the consistency of recommendations by conventional recommendation algorithms.

Furthermore, the traditional CF approaches purely mines the user-item rating matrix for recommendations, which cannot sufficiently provide accurate and reliable predictions. However, in daily life, people usually like to refer to their trust friends’ preferences to make decisions rather than the mass population. To this end, Qian et al. [31] present a novel social recommendation using global rating reputation and local rating similarity (SoRS) framework to make user social recommendations. Li et al. [21] propose a social matrix factorization method to optimize the prediction solution in both users’ latent feature space and user-item rating space using the individual trust among users. Wang et al. [44] propose a social-enhanced content-aware recommendation method by fusing the social network, and item’s and user’s reviews. Although these CF approaches, incorporating social trust relationships, are very effective for the industrial recommendation, their performance may be limited by these linear models based on matrix factorization.

Recently, deep learning has made breakthroughs in a lot of domains, such as image processing [28, 42], natural language understanding [27] and speech recognition [10], which can bring new opportunities for the research of recommendation systems. Compared with traditional machine learning methods [34, 35], deep learning not only has a strong ability to learn the essential characteristics of data sets from samples, but also can obtain deep-level feature representations of users and items. Moreover, deep learning uses automatic feature learning from multi-source heterogeneous data to map the different structures of data to the same hidden space, which can obtain a unified structure of the data. Based on these benefits of deep learning, it has attracted much attention on how to utilize deep learning to improve the recommendation performance of the recommendation system. For example, Wei et al. [45] propose an integrated recommendation models with CF and deep learning to solve the complete cold-start problem (IRCD-CCS). Fu et al. [11] propose a novel deep learning method that imitates an effective intelligent recommendation by understanding the users and items beforehand to overcome these setbacks that CF-based methods only grasp the single type of relation. Dau et al. [6] propose a recommendation system that utilizes aspect-based opinion mining (ABOM) based on the deep learning technique to improve the accuracy of the recommendation process. Shamshoddin et al. [38] propose a deep learning with collaborative filtering technique for the recommendation system to predict user preferences from the Internet of Things (IoT) devices and Social Networks.

In general, deep learning needs massive data to train a robust and accurate model, which contains a lot of parameters to fit training data. It means that they need heavy computing resources. Moreover, for the recommendation systems, the data of user feedbacks are usually sparse and discrete, because of that user usually rates or experiences only a small fraction of the available items. If these sparse and discrete data are used directly to train the model, these problems may severely restrict the performance of deep neural networks for recommendation systems. Therefore, it is necessary to develop a method that can not only reduce computing resources, but also mine these sparse and discrete data in recommendation systems.

To this end, this paper proposes an Enhanced Autoencoder Framework to learn robust knowledge from the sparse and discrete data of use feedbacks, named as EAF-SR. In summary, the main contributions of this paper are summarized as follows:

  • An Enhanced Autoencoder Framework (EAF-SR) is proposed for social recommendation by using the technique of knowledge distillation.

  • For better learning robust information from the data of user feedbacks, an autoencoder is employed to map the different structures of data into the same hidden space, which can obtain a unified structure of the data.

  • In order to alleviate the problem of data sparseness, a stacked denoising auto-encoder is proposed to generate soft targets, thereby learning the hidden information in users’ social information.

  • To reduce computing resources, a knowledge distillation-based recommendation method is proposed, thereby making it run in a low-cost system.

  • To provide robust recommendations to users, the pre-training and re-training networks is proposed to make predictions.

The remaining of this paper is organized as follows. Section 2 reviews the previous works regarding recommendation systems. Section 3 gives the details of the proposed EAF-SR. In Section 4, a series of experiments are conducted to illustrate the performance of the proposed EAF-SR. Section 5 draws conclusions.

2 Related works

In this section, we review the related works briefly, including social recommendation, deep learning for recommendation systems and knowledge distillation.

2.1 Social recommendation

Social scientists have long believed that the users’ preferences are similar to her/his social connections with the social correlation theories of homogeneity and social influence [48]. With the increasing popularity of online social networking platforms, the social network information has become an effective data source to alleviate data sparsity problem and enhance recommendation performance [40]. Since the latent factor-based models perform better than the neighborhood-based methods in CF, a popular direction is how to design more complex latent factor-based social recommendation models. For example, Ma et al. [26] propose a latent factor-based framework with social regularization for recommendation. SocialMF is proposed to incorporate the social influence theory into classical latent factor-based models [15]. By treating each users’ social connections’ preferences of an item as the auxiliary feedback of this user, researchers proposed a trust-based latent factor-based model that leveraged the auxiliary feedback [12].

With the observation that users tend to assign higher rankings to items that their friends prefer, a personalized ranking-based social recommendation is proposed that extends the classical BPR model with the observation [54]. Researchers also argued that both positive and negative links in social networks provide valuable clues for recommendation performance [9]. These social recommendation algorithms have also been extended to incorporate rich context information, such as social circle [30] and item content [55]. Since the performance of these latent factor-based models for social recommendation relied on the initialization of user and item latent factors, researchers proposed to apply autoencoder, an unsupervised deep learning technique in initialization [7].These models showed improved performance over classical recommendation models. Nevertheless, few have explored the possibility of designing deep learning-based social recommendation models. Recently, neural social collaborative ranking is proposed to solve the problem of bridging a few overlapping users in the two domains of the social network domain and information domain [43]. Researchers also proposed to use deep learning models to model the social influence strength for temporal social recommendation [52]. This paper on social recommendation differs from these works as it focuses on learning robust information from user social network to generate soft labels.

2.2 Deep learning for recommendation systems

Recently, deep learning has achieved unexpected and great success in many fields due to its great learning ability, including Object Detection [49], Speech Recognition [46] and Natural Language Processing [24]. Motivated by it, there are many works proposed to use it to improve the performance of recommendation systems. Especially, Salakhutdinov et al. [36] propose the Restricted Boltzmann Machine (RBM) model by mapping the original input data from the visualization layer to the hidden layer and remapping the obtained hidden layer vector to the visible layer to obtain the rating of items. Sedhain et al. [37] develop an Autoencoders Meet Collaborative Filtering (AutoRec) model by constructing the missing part of the matrix and applying the constructed data to recommend products. Anil et al. [1] propose a LSTM-GRU-Hybrid method by combining deep learning neural architectures and collaborative filtering to provide an effective recommendation.

In a word, deep learning needs massive data to train deep neural networks. However, the data used in recommendation systems always faces serious sparse problems, which make it difficult to take full advantage of the deep neural networks in the recommendation system.

In most cases, the observed and unobserved user-item pairs are treated as ones and zeros in the recommendation system. However, these methods cannot fully reflect the real life. In fact, it is difficult to accurately describe the users’ preferences through simply using ones and zeros, which makes it not easy to model users with these noisy data. Based on this fact, in order to improve the accuracy of recommendation, a pre-training network is proposed to reduce the noise.

2.3 Knowledge distillation

Recently, Hinton et al. [13] propose the soft targets of a generation network containing abundant of information, which can be used to train another network and achieve competitive performance. Motivated by this idea, Cui et al. [5] propose a Multi-View Recurrent Neural Network (MV-RNN) model to use soft targets of separate structures to handle multi-view features. Dighe et al. [8] utilize the technique of soft targets to improve the mapping of far-field acoustics to close-talk senone classes. Zhao et al. [56] propose a new collaborative teaching knowledge distillation (CTKD) strategy to use the valuable information among the training process associated with training results to improve the performance of the student network. However, in the actual situation, these soft targets in the pre-training network also have a lot of noise, which is not perfect to fit the input data. Therefore, the phases of pre-training and re-training network are incorporated into a unified framework. In this way, the proposed model can propagate the training errors of the distillation process to tune the soft targets and reduce their noise to improve performance. Moreover, a novel distillation layer is proposed to adjust each unit of generated vectors to balance the effects of knowledge and noise based on the corresponding reliability. Finally, different from the previous works that make predictions solely based on the results of distillation network, this paper proposes to make final recommendations by integrating both results of the generation and distillation subnetworks.

3 An enhanced autoencoder framework for social recommendation

In this section, an enhanced autoencoder framework is proposed for social recommendation, named as EAF-SR, which consists of three parts (see Fig. 1): Pre-training, distilled knowledge layer, and Re-training. Specially, Pre-training is designed to generate soft targets by utilizing SDAE [2]. Next, the distillation layers are proposed to balance the knowledge and noise of outputs of Pre-training. After that, Re-training is developed to learn implicit knowledge from soft targets based on autoencoder structure. Finally, both Pre-training and Re-training network are combined by using distilled knowledge to make robust recommendations for users.

Fig. 1
figure 1

The entire structure of the proposed EAF-SR, which consists of three modules: 1) Pre-training for generating soft targets with SDAE; 2) Distilled knowledge; 3) Re-training for learning implicit information from soft targets

3.1 Notation and motivations

3.1.1 Notation

Notations used in this paper is first introduced, as shown in Table 1.

Table 1 List of Key Symbols

3.1.2 Motivations

For top-N recommendation, the observed and unobserved user-item pairs are regarded as ones and zeros, respectively. However, utilizing these hard targets are difficult to exactly reflect users’ true interests on items. In fact, users may prefer some of the potential unobserved items than observed items. As a consequence, it is not easy to exactly model users by directly utilizing these discrete hard targets of user feedbacks.

For example, as demonstrated in Fig. 2, there are two figures to represents user feedbacks for items in views of hard and soft targets, respectively. Suppose there are three users {u1, u2, u3} and four items {v1, v2, v3, v4} in recommender system, Fig. 2a demonstrates the observed user-item matrix with hard targets and Fig. 2b shows a possible user-item matrix to show latent user preferences with soft targets for items. Specially, in Fig. 2a, the preference of user u2 is closer to user u1 than u3. However, in Fig. 2b, user u2 is closer to user u3 than u1. This example shows that directly learning user preferences from hard targets may leads to incorrect results.

Fig. 2
figure 2

An example of two user-item matrices in recommender systems: (a) A user-item matrix with hard targets for recommendation systems, where “1” for observed data and “?” for others; (b) A user-item matrix with soft targets to show latent user preferences for recommendation systems

Towards this problem, we propose to transfer the hard targets with discrete values to soft targets with continuous values by a pretrained model, and then utilize another network to learn user preferences from the generated soft targets for recommender systems. However, this technique of knowledge distillation is not a free lunch. Since the soft targets are generated by an imperfect model, there exist a lot of noise among the generated data. These noises have an adverse impact on the task of knowledge distillation. Therefore, how to develop a robust model to address this problem is remaining challenge.

In this paper, we propose a novel Enhanced Autoencoder Framework (EAF-SR) for Social Recommendation with knowledge distillation to learn robust information from soft targets for users. As demonstrated in Fig. 1, the overall structure of EAF-SR is consisted of three components: a Pre-Training network, a distillation layer and a Re-training network. Specially, the soft targets are generated by a SDAE network, which is proposed by injecting a user node into the hidden layer to exactly model users’ interests. Followed by this network, we propose a novel distillation layer to adjust the generated targets to remain useful knowledge and reduce noise for retraining based on the corresponding reliability of each unit. Finally, we use an autoencoder network to learn implicit information from these soft targets for each user.

To learn the soft targets, we incorporate the generation and retraining stages into a unified framework for EAF-SR. As demonstrate in Fig. 1, the soft targets in EAF-SR model are constrained with hard targets in three perspectives: they are generated from hard targets, they are close to hard targets, and they can be used to reconstruct hard targets. So that the soft targets contain much less noise than hard targets, which make it easier to learn useful information from soft targets.

In particular, there exists a critical question about this idea. From the perspective of information theory, the useful information in generated soft targets is not more than that in original data. So that it raises a question of how can the EAF-SR model mine more information from soft targets than hard targets? Actually, this approach does not increase any information of users. As demonstrated in Fig. 1, this method turns the perspective to understand users by soft targets, which contains much less noise than hard targets. Therefore, by understanding users in different perspectives, this method is easier to learn robust knowledge from user feedbacks.

3.2 Pre-training for generating soft targets with SDAE

Recently, great achievements have been made in applying deep learning to recommendation systems. For example, Li et al. [20] propose a general deep architecture for collaborative filtering by integrating matrix factorization with deep feature learning. Wu et al. [47] present a novel method, called Collaborative Denoising Auto-Encoder (CDAE), for the top-N recommendation that utilizes the idea of Denoising Auto-Encoders. Zhang et al. [51] propose a new hybrid model by generalizing the contractive auto-encoder paradigm into a matrix factorization framework with good scalability and computational efficiency. Li et al. [19] propose a Bayesian generative model called collaborative variational autoencoder (CVAE) that considers both rating and content for the recommendation in a multimedia scenario. Liang et al. [23] extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback. From these studies, it can be found that most of the approaches are based on denoising autoencoder (DAE) technology, which has proved that it can improve the performance of recommendation systems. Therefore, in this subsection, a stacked denoising autoencoders (SDAE) based pre-training method (see Fig. 3) is proposed to generate soft target.

Fig. 3
figure 3

Pre-training based on SDAE for recommendation with user side information, which includes two parts: (a) User-item rating; (b) Social trust information

Suppose the user set U = {u1,u2,u3,⋯ ,un}, the item set V = {v1,v2,v3,⋯ ,vm} and the rating matrix \(R\in \mathbb {R}^{n\times m}\). Herein, Rnm is the rating given the item m by user n that if the user m has rated the item n, its value is Rnm≠ 0, otherwise, it is Rnm = 0. Moreover, the rating set \(r_{u}=\{r_{u_{1}},r_{u_{2}},\cdots ,r_{u_{n}}\}\in \mathbb {R}^{n\times m}\) and \(r_{u_{1}}\) is the rating vector of user u1. Next, suppose the user u side information D, where \(D_{u}=\{d_{u_{1}},d_{u_{2}},\cdots ,d_{u_{n}}\}\in D\), and \(d_{u_{1}}\) is the trust vector of user u1. The steps of generating soft target with SDAE are as follows:

  1. 1)

    Each user uU is first made to \(\widetilde {u}\) by adding noise. Then, for each hidden layer l = {1,2,3,⋯ ,L − 1} (where its optimal value is 30, as shown in Section 4.4.5), it can obtain

    $$ h_{l}=g({W_{l}^{T}}h_{l-1}+S_{l}\widetilde{D}_{\widetilde{u}}+b_{l}), $$
    (1)

    where g(x) is an activation function that can be described by the sigmoid function \(g(x)=\frac {1}{1+e^{-x}}\), \(W_{l}\in \mathbb {R}^{n\times k}\) and \(S_{l}\in \mathbb {R}^{n\times k}\) indicate the weight matrices of the l layer, bl represents the bias vector of the l layer, \(\widetilde {D}_{\widetilde {u}}\) denotes the side information of user u who has been added noise and \(h_{0}=\widetilde {r}_{\widetilde {u}}\) is one of corrupted inputs of ratings.

  2. 2)

    For output layer L, the calculation method for reconstructing the output layer is shown as

    $$ \left\{ \begin{aligned} &\hat{R}_{u}^{(1)}={g}(W_{L}h_{L}+b_{D_{u}})\\ &\hat{D}_{u}^{(1)}={g}(S_{L}h_{L}+b_{D_{u}}) \end{aligned} \right. $$
    (2)

    where \(W_{L}\in \mathbb {R}^{n\times k}\) and \(S_{L}\in \mathbb {R}^{n\times k}\) indicate the weight matrices of the L layer, and \(b_{D_{u}}\) represents the bias vector.

  3. 3)

    Use the deep network integrated into the social information to reconstruct the input and minimize the square loss between the input and the reconstruction. The loss function is defined as

    $$ \begin{array}{@{}rcl@{}} loss(R,D,{\hat{R}_{u}^{(1)}},{\hat{D}_{u}^{(1)}})&=&\sum\limits_{{u}}\left[(R-{\hat{R}_{u}^{(1)}})^{2}+(D-{\hat{D}_{u}^{(1)}})^{2}\right]\\&&+\lambda(\|W_{l}\|_{F}^{2}+\|S_{l}\|_{F}^{2}+\|b_{D_{u}}\|_{F}^{2}) \end{array} $$
    (3)

    where λ denotes a hyper-parameter to avoid the overfitting and \(\|\cdot \|_{F}^{2}\) is the Frobenius norm.

3.3 Distilled knowledge layer

Knowledge distillation (KD) is proposed by Hinton [13]. It is to distill the knowledge learned in the complex teacher network into a simple student network. The “softmax” output layer is to normalize each type of value zi generated by the neural network, and finally, the probability of correct classification pi is obtained, where the greater the probability, the greater the probability of being classified into that category. The probability of each category is expressed as

$$ p_{i}=\frac{\exp{(z_{i})}}{\sum\limits_{j}\exp{(z_{j})}}, $$
(4)

The core idea of knowledge distillation is to use soft targets to assist hard targets for training. First, the probability distribution of teacher network “softening” is calculated, and then it is used as the part of the total loss to induce student network training. Among them, the “softening” probability distribution calculation method is based on the “softmax”, and a temperature parameter t is introduced to obtain the probability distribution of “softening” as

$$ q_{i}=\frac{\exp{(\frac{z_{i}}{t})}}{\sum\limits_{j}\exp{(\frac{z_{j}}{t})}}, $$
(5)

where zi is the output logit before the softmax of the neural network. t denotes the temperature. When t = 1, it is the common softmax output probability. If t is larger, the distribution of classification probability will be relaxed. If t is smaller, the probability of misclassification will be enlarged, which is easy to incorporate unnecessary noise.

In recommendation systems, each user has multiple tags (such as Time, Age and Company), as shown in Fig. 4. Therefore, benefits from knowledge distillation, it can be used to assist hard labels to train by obtaining the information from users’ tags (soft labels). The output of the distillation layer is described as

$$ \left\{ \begin{aligned} Q_{u}=&\frac{\exp{(\frac{Z_{u}}{t})}}{\sum\limits_{u}\exp{(\frac{Z_{u}}{t})}}\\ Z_{u}=&W_{L}h_{L}+b_{D_{u}} \end{aligned} \right. $$
(6)

where Zu indicates the output logit before “softmax” of the neural network. Herein, the distillation layer is developed to adjust each output unit of pre-training network.

3.4 Re-training for learning implicit information from soft targets

After distillation, there is some implicit information in the soft targets that aren’t learned. Thus, the re-training network is used to learn them, which its details are as follows:

  1. 1)

    The soft targets are first mapped to the low-dimensional hidden layer by

    $$ H_{u}=g({W_{1}^{T}}Q_{u}+b_{1}), $$
    (7)

    where \(W_{1}\in \mathbb {R}^{k\times m}\) and \(b_{1}\in \mathbb {R}^{k}\) indicate training parameters for mapping the input vector of Qu to a dimension of the low-dimensional space of k.

  2. 2)

    The implicit information is then learned from the low-dimensional hidden layer by using knowledge distillation. The output of the distillation process for user u can be expressed as

    $$ \left\{ \begin{aligned} &\hat{R}_{u}^{(2)}={g}({W_{2}^{T}}H_{u}+b_{2})\\ &\hat{D}_{u}^{(2)}={g}({W_{3}^{T}}Q_{u}+b_{3}) \end{aligned} \right. $$
    (8)

    where \(W_{2}\in \mathbb {R}^{k\times m}\), \(b_{2}\in \mathbb {R}^{k}\) and \(b_{3}\in \mathbb {R}^{k}\) denote the training parameter to make predictions for each user.

  3. 3)

    The output of re-training is obtained by using deep network to integrate hard targets and users’ node. The loss function of re-training is defined as

    $$ \begin{array}{@{}rcl@{}} loss(R,D,{\hat{R}_{u}^{(2)}},{\hat{D}_{u}^{(2)}})&=&\sum\limits_{{u}}\left[(R-{\hat{R}_{u}^{(2)}})^{2}+(D-{\hat{D}_{u}^{(2)}})^{2}\right]\\&&+\lambda(\|b_{1}\|_{F}^{2}+\|b_{2}\|_{F}^{2}+\|b_{3}\|_{F}^{2}) \end{array} $$
    (9)

Next, the loss function from the two parts of pre-training and re-training networks is obtained. The loss function is described as

$$ \begin{array}{@{}rcl@{}} &&loss(R,D,{\hat{R}_{u}^{(1)}},{\hat{D}_{u}^{(1)}},{\hat{R}_{u}^{(2)}},{\hat{D}_{u}^{(2)}})\\&=&\alpha\sum\limits_{{u}}loss\left( R,D,{\hat{R}_{u}^{(1)}},{\hat{D}_{u}^{(1)}}\right)+(1-\alpha)loss\left( R,D,{\hat{R}_{u}^{(2)}},{\hat{D}_{u}^{(2)}}\right)\\ &&+\lambda\left( \|W_{l}\|_{F}^{2}+\|Z_{l}\|_{F}^{2}+\|b_{D_{u}}\|_{F}^{2}+\|b_{1}\|_{F}^{2}+\|b_{2}\|_{F}^{2}+\|b_{3}\|_{F}^{2}\right) \end{array} $$
(10)

where α represents the balance parameter to be used to adjust the proportion of the generated network and the re-training network. λ means adjustment parameter to prevent overfitting.

Finally, four predict functions of generation and distillation subnetworks for each user is obtained, i.e., \(\hat {R}_{u}^{(1)}\), \(\hat {R}_{u}^{(2)}\), \(\hat {D}_{u}^{(1)}\) and \(\hat {D}_{u}^{(2)}\). Specially, these four kinds of results are training to make predictions from different perspectives. The Pre-training network is focused on the known user-item pairs while the Pre-training one is more focused on the unknown user-item pairs. Therefore, they are combined to make final recommendations for users by:

$$ \hat{R}_{u}=\beta(\hat{R}_{u}^{(1)}+\hat{D}_{u}^{(1)})+(1-\beta)(\hat{R}_{u}^{(2)}+\hat{D}_{u}^{(2)}) $$
(11)

where β is employed to control the contribution of Pre-training and Re-training for predict final results. In this way, a more robust prediction function without additional training costs is achieved.

3.5 Model training

In order to get the optimized solution, the Stochastic Gradient Descent approach (SGD) [4] is used to train our EAF-SR model. In regards to each iteration, these training parameters of our model can be updated by gradients. The update formula is defined as follows

$$ \theta_{e+1}=\theta_{e}-\eta g_{e}, $$
(12)

where 𝜃e is the values of trained parameters 𝜃 in EAF-SR model at iteration e. η denotes the learning rate for each iteration of training process. ge indicates the gradients values at iteration e times. Herein, the details of this training process are summarized in Algorithm 1.

Algorithm 1
figure a

EAF-SR.

4 Experiments

In this section, a series of experiments are conducted to demonstrate the performance of the proposed EAF-SR, in which the batch size, learning rate and training epoch are set to 128, 0.0005 and 1000, respectively. Meanwhile, these experiments are implemented on PyCharm and carried out on a workstation with three NVIDIA GeForce RTX 2080Ti GPU and 10GB memory.

4.1 Description of datasets

To validate the performance of the proposed approach, three real-world datasets related to social CF are adopted and they are taken from popular social networking websites, including Flixster,Footnote 1 EpinionsFootnote 2 and Douban,Footnote 3 which permit users to present their interests by posting feedback and ratings for all items. Furthermore, these datasets have different rating sparsity and social information, in which their statistics are presented in Table 2 and their details are as follows:

Table 2 The statistics of three real-world datasets

Flixster

is an American social-networking movie website for discovering new movies, learning about movies, and meeting others with similar tastes in movies. The platform helps users to watch movie trailers as well as read about new and upcoming movies at the box office. Its dataset contains 1,049,511 users, 66,726 items, 8,196,077 ratings and 11,794,648 social relations. However, most of users have not rated any items. Therefore, those users or items with no ratings that are meaningless to experimental evaluation are removed, Finally, this dataset is collected, including 147,612 users, 48,794 items, 8,196,077 ratings and 2,538,746 social relations.

Epinions

is a well-known website for product review established in 1999 [29]. Users can rate products from 1 to 5 and submit their personal reviews. These ratings and feedback will affect other consumers as they determine whether to purchase same product. In addition, users are also able to specify who to trust and create a social trust network. In social trust network, user feedback and ratings can be reliably found to be important by their trustees. Finally, this dataset contains 49,289 users, 139,738 items, 664,823 ratings and 487,183 trust relations.

Douban

is a Chinese social networking service website that allows registered users to record information and create content related to film, books, music, recent events, and activities. For registered users, the website recommends potentially interesting books, movies, and music to them in addition to serving as a social network website such as WeChat. Users can assign 5-scale integer ratings (from 1 to 5) to movies, books and music. Finally, this dataset contains 129,490 users, 58,541 items, 16,830,839 ratings and 1,692,952 friend relations.

In the experiments, these datasets are divided into two disjoint user sets: a training set and a test set. Among them, the test set is constructed by randomly selecting 1000 users with at least 500 ratings and 20 social relationships, and the remaining users and their ratings are kept in the training set.

4.2 Evaluation metrics

In recommendation system, a lot of previous studies usually use the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) to evaluate the performance of the model. However, many users only pay attention to the list of some items of their interests rather than all items in real life, which means that MAE and RMSE cannot accurately reflect the performance of the recommended model. Thence, in this paper, the Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain@N (NDCG@N) are adopted to evaluate the performance of the proposed method, where the larger the values of MAP and NDCG@N, the more the recommendation accuracy and the higher the recommendation quality. The metrics of MAP and NDCG@N are respectively defined as

$$ \left\{ \begin{aligned} &AP_{u}=\frac{1}{I^{u}}\sum\limits_{i\in I^{u}}{\frac{\delta(rank_{uj}<rank_{ui})+1}{rank_{ui}}}\\ &MAP=\frac{\sum\limits_{u=1}^{|U^{te}|}{AP_{u}}}{|U^{te}|} \end{aligned} \right. $$
(13)

and

$$ \left\{ \begin{aligned} &DCG_{u}=\sum\limits_{i=1}^{N}{\frac{2^{{rel}_{i}}-1}{\log_{2}{(i+1)}}}\\ &IDCG_{u}=\sum\limits_{i=1}^{N}{\frac{1}{\log_{2}{(i+1)}}}\\ &NDCG_{u}@N=\frac{DCG_{u}@N}{IDCG_{u}}\\ &NDCG@N=\frac{\sum\limits_{u\in U^{te}}{NDCG_{u}@N}}{|U^{te}|} \end{aligned} \right. $$
(14)

where Iu denotes that the correct set is recommended in the recommendation list of user u, rankui indicates the ranking position of item i in the recommendation list of user u, and Ute represents all users in the test set. Moreover, reli indicates the relevance of the recommendation result at position i, which means that if the item at top N is adopted, its value is 1, otherwise, it is 0.

Fig. 4
figure 4

The context_aware of the rating matrix

4.3 Comparison

To evaluate the performance of the proposed method, the following ones are selected as competitors:

  • BPR [33]: This is a sorting recommendation algorithm that sorts items according to the users’ interest in items and then selects the items with the highest priority to recommend to users, where k = 128 and λ = 0.01.

  • PRFM [32]: This is a Ranking Factorization Machine (Ranking FM) model, which applies the Factorization Machine model to microblog ranking on basis of pairwise classification, where k = 3, λw = 10− 6 and λv = 10− 4.

  • trustMF [50]: This is a matrix factorization method based on either rating data or trust data, where λ = 0.001 and β1 = β2 = 0.5.

  • aSDAE [2]: This is a Top-N recommendation model by using the side information of the user based on the reconstruction function of the stacked denoising autoencoder, where k = 50.

  • FunkR-pDAE [53]: This is a funk singular value decomposition recommendation method by using pearson correlation coefficient and Deep Auto-Encoders, where α = 0.5, β = 0.5 and γ = 1.

  • DVMF [39]: This is a deep learning based fully Bayesian treatment recommendation framework by integrating users’ information, where φ = 0.5 and λ = 0.001.

4.3.1 Comparison results

Table 3 shows the results of experiments comparing the proposed EAF-SR with other latest methods. It can be seen that the proposed EAF-SR can obtain 7.59%, 6.56%, 5.61%, 1.95%, 1.20% and 0.99% relative best improvement in the term of MAP@10 in the Flixster dataset compared with BPR, PRFM, trustMF, aSDAE, FunkR-pDAE and DVMF. And it also obtains 5.50%, 3.89%, 1.61%, 1.18%, 0.79% and 0.54% relative best improvement in the term of MAP@10 in the Epinions dataset compared with those approaches. Last but not least, EAF-SR obtains 8.42%, 7.60%, 4.61%, 3.21%, 2.76% and 1.94% relative best improvement in the term of MAP@10 in the Douban dataset compared with those approaches. Moreover, EAF-SR obtains 5.86%, 5.44%, 3.17%, 2.63%, 2.17% and 1.99% relative best improvement in the term of NDCG@10 in the Flixster dataset compared with those approaches. And EAF-SR obtains 6.99%, 5.84%, 0.58%, 0.42%, 0.28% and 0.18% relative best improvement in the term of NDCG@10 in the Epinions dataset compared with those approaches. In addition, EAF-SR obtains 9.28%, 6.13%, 2.84%, 2.53%, 1.71% and 1.29% relative best improvement in the term of NDCG@10 in the Douban dataset compared with those approaches. Furthermore, it can also be observed that as N in top-N increases, the performance of the methods also increases (see MAP@1, MAP@5 and MAP@10).

Table 3 The comparison results of seven approaches in three different datasets

In a word, as the increase of MAP@10 and NDCG@10 represents the improvement of recommendation accuracy and quality, experimental results show that EAF-SR is more powerful than the BPR, PRFM, trustMF, aSDAE, FunkR-pDAE and DVMF.

4.3.2 Comparisons of running time

In order to assess the computational efficiency of the model, many experiments are conducted on a workstation with three NVIDIA GeForce RTX 2080Ti GPU and 10GB memory to compare the running time of the proposed EAF-SR with other methods (e.g., BPR, PRFM, trustMF, aSDAE, FunkR-pDAE, and DVMF). For fair comparisons, all the experiments of these methods are performed in 100 iterations. In the actual situation, it is known to all that the complexity of using CPU to run the EAF-SR model is lower than using the GPU. However, parallel computing on the CPU to train deep neural networks is not efficient. Therefore, all experiments are performed in parallel on the GPU.

As shown in Table 4, with the GPU capacity, the running time of EAF-SR is less than BPR, PRFM, trustMF, aSDAE, FunkR-pDAE, and DVMF on all datasets, which shows that the proposed EAF-SR is better in terms of computational efficiency in the recommendation system.

Table 4 The comparisons of running time (s) of seven approaches with k = 80 in three different datasets

4.4 Parameters analysis

From the objective function (10), it can be seen that the values of five parameters need to be determined, i.e., t, k, α, λ, and L. Thus, this section will discuss their effects on the performance of the proposed EAF-SR. Herein, to reduce the complexity of the experiment, the terms of MAP@10 and NDCG@10 are used to evaluate its performance.

4.4.1 The impact of parameter t

In the proposed EAF-SR, the parameter t denotes a temperature to regulate the output values of the distillation layer. To study its effects, it is set to {1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 and 5}. In other words, the experiments are conducted to test the influence of difference t on the proposed EAF-SR. It can be observed from Fig. 5 that the values of MAP@10 and NDCG@10 rise fastest with the increase of the temperature t at the beginning of the experiment. When t is 3, their values are the biggest and the recommended result is the best. The results indicate that a larger or smaller value of temperature t can affect the performance of knowledge distillation.

Fig. 5
figure 5

The performance of the proposed EAF-SR with different t values, where the best result is achieved at t = 3

4.4.2 The impact of parameter k

In order to discuss the effect of dimensionality k, numerous experiments are carried out on the EAF-SR model with different values of dimensionality k in {10, 20, 30, 40, 50, 60, 70, 80, 90 and 100}. And the grid-search approach is used to find the best combination values. Figure 6 illustrates that with the increase of dimensionality k, the values of MAP@10 and NDCG@10 gradually increase at first. However, when the dimensionality k exceeds a threshold (about 80 for Flixster, Epinions, and Douban datasets), the values of MAP@10 and NDCG@10 are decreasing. For the results of the experiments, there are two explanations for this observation:

  1. 1)

    A relatively larger dimension is contributing to better performance.

  2. 2)

    When the dimensionality reaches a certain threshold, it can trigger the problem of over-fitting, which turns out to degrade the accuracy of the prediction.

Fig. 6
figure 6

The performance of the proposed EAF-SR with different k values, where the best result is achieved at k = 80

4.4.3 The impact of parameter α

The parameter α is adopted to balance the importance of pre-training and re-training networks in Formula (10). A larger value of α indicates that the retraining network has more impact. An extremely small values of α makes the proposed EAF-SR model focus on the soft target generation network. To study the influence of parameter α, it is set to {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Figure 7 shows the result of experiment of the proposed method with diverse values of α. It can be observed that with the increase of α, the values of MAP@10 and NDCG@10 dramatically increase. In other words, with the values of α obtaining larger, the performances of EAF-SR go better to a certain point (around 0.5 for Flixster, Epinions, and Douban datasets). However, when α surpasses 0.5, the values of MAP@10 and NDCG@10 are decreasing. Therefore, the experimental results indicate that the recommended result of the proposed EAF-SR is the best when α = 0.5.

Fig. 7
figure 7

The performance of the proposed EAF-SR with different α values, where the best result is achieved at α = 0.5

4.4.4 The impact of parameter λ

The parameter λ is a hyper-parameter to prevent the over-fitting. To study its impact on the the performances of the proposed EAF-SR, it is set to {0.00001, 0.0001, 0.001, 0.01, 0.1 and 1}. By using the grid-search approach, the values of λ are obtained. From Fig. 8, it can be seen that when λ = 0.1, the values of MAP@10 and NDCG@10 are the largest. In other word, the optimal value of λ is 0.1.

Fig. 8
figure 8

The performance of the proposed EAF-SR with different λ values, where the best result is achieved at λ = 0.1

4.4.5 The impact of parameter L

The parameter L denotes the number of layers of the proposed EAF-SR. In order to discuss its effect, a series of experiments are conducted on the EAF-SR model with different values of L in {10, 20, 30, 40, 50, 60, 70, 80, 90 and 100}. Figure 9 illustrates that with the number of layers L increases, the values of MAP@10 and NDCG@10 gradually increase at first. However, when it exceeds a threshold (about 30 for Flixster, Epinions, and Douban datasets), the values of MAP@10 and NDCG@10 are decreasing, i.e., the optimal value of the number of layers L is 30.

Fig. 9
figure 9

The performance of the proposed EAF-SR with different layers L, where the best result is achieved at L = 30

In summary, the optimal values of parameters t, k, α, λ and L are listed as Table 5.

Table 5 The best values of parameters t, k, α and λ in three data sets

4.5 Ablation studies on epinions dataset

For a better understanding of the proposed EAF-SR, the effects of Pre-training and Re-training are first investigated on Epinions dataset. As shown in Table 6, when only Pre-training is plugged into the network, the MAP@10 and NDCG@10 are 0.7584 and 0.8359, respectively. Likewise, Re-training is simply added to the model, and the MAP@10 and NDCG@10 are 0.7003 and 0.8231. Finally, both Pre-training and Re-training are plugged into the network, EAF-SR further improves the MAP@10 and NDCG@10 to 0.9108 and 0.9425.

Table 6 Ablation results of the proposed EAF-SR

5 Conclusions

In this paper, an Enhanced Social Recommendation Algorithm (EAF-SR) is proposed to address the issues, such as cold-start, data noise and data sparsity. In the EAF-SR framework, the issue of learning hidden features from the soft target which is produced by the pre-training network is also investigated. Specially, a tightly coupled system is proposed to integrate the pre-training and re-training phases into a unified framework. By this way, the model can dynamically update the soft target by adjusting the training errors of the pre-training and retraining networks in real time, which is to reduce the influence of noise and preserve knowledge. Furthermore, a new knowledge distillation layer is designed to regulate the outputs of pre-training model for dynamic training based on the reliability of each output unit. Then, a new measurement approach is proposed to calculate the reliability of the output unit based on the number of corresponding positive markers. Finally, the recommendation is made by combining the predictions of pre-training and re-training networks.