Keywords

1 Introduction

The recommendation system plays an extremely important role in e-commerce platforms, because it helps the platform promote advertisements and products to users and leads to greater commercial benefits [1]. Currently, collaborative filtering is the most widely used commercial recommendation algorithm. This algorithm learns to build a rating matrix based on the existing item-user ratings to predict the user ratings of unknown items [2, 3]. With the advent of the era of big data, the number of users and products has soared. Most of the products have been rated by only a small number of users. Thus, the sparsity of the rating matrix seriously affects the quality of the recommendation results [4].

Therefore, to improve the accuracy, this paper proposes a new method. uses Stack Denoising Auto-Encoder based on Bayesian Personalized Ranking to determine the relevance ranking table for each unique item. This method is different from the previous method that relies on specific context information. For each item, we choose the automatic encoder method in the feature extraction step. This approach has the generalization capability due to adding the noise to the input data, achieving greater robustness [5]. Additionally, by ranking the similarity probabilities of other items and itself, it is guaranteed that these similar items are ranked higher than the dissimilar items, and this sorting method is proved to be effective and can solve the imbalance problem. To address the large computational cost, we proposed a pre-training + fine-tuning strategy [6] for the model.

In this paper, the proposed model integrates the advantages of the BPR and the SDAE into a deep learning model. Compared with the traditional collaborative filtering recommendation algorithm, this model has some unique advantages. First, the rating vector of each item can obtain a more complex representation of the hidden features after extracting the deep network through the SDAE. Meanwhile, the addition of noise also improves the anti-interference property of the model, making the extracted features more reliable. Second, the final BPR ranking part can better capture the unique characteristics of each item and give the probability of the similarity between each item to reduce the impact of data sparsity effectively. Thus, this approach helps to improve the accuracy of recommendation. Third, in order to avoid poor parameter estimation, we design a pre-training and fine-tuning strategy based on the Bernoulli probability model. In the last part of this paper, the experiments carried out on real commodities datasets are described. The results show that this model obtains more accurate recommendation results than the classical collaborative filtering algorithm. Figure 1 shows an overview of our method.

Fig. 1.
figure 1

Overview

The remainder of this paper is organized as follows. In Sect. 2, we introduce the deep neural network model based on Stack Denoising Auto-Encoder and Bayesian Personalized Ranking. Section 3 presents the details and results of the experiments. Section 4 reviews the related works. Section 5 states the conclusions and describes future research directions.

2 Model Framework

This section first describes a pairwise raking task in the commodity recommendation problem. Then, we propose a Bayesian deep neural network model based on the BPR and describe the pre-training strategy. Figure 2 shows the architecture of the model.

Fig. 2.
figure 2

SDAE-BPR

2.1 Preliminaries

User \( u \in R^{N} \), item \( I \in R^{I} \). \( E \in \left\{ {0,1} \right\}^{I \times N} \) is the rating matrix formed by all of the user’s scores on all of the items. Notation \( e_{iu} = E\left( {i,u} \right) = 1 \) indicates that user u is interested in item i, and \( e_{iu} = E\left( {i,u} \right) = 1 \) indicates that user u is not interested in item i. In most e-commerce platforms, the rating matrix E is usually sparse because most users can only obtain access to a small part of the entire item database.

Define a similarity probability matrix \( R \in \left[ {0,1} \right]^{I \times I} \), with notation \( r_{ij} = R\left( {i,j} \right) \) representing the similarity probability of items i and j. Thus, for each item i, it can be divided into two disjoint sets that include the set of items with similar relations \( P_{i} = \{ j\left| {r_{ij} = 1} \right.\} \) and the set of items with uncertain similar relations \( M_{i} = \{ j\left| {r_{ij} < 1} \right.\} \). Here, we seek to recommend the ranking task learning model. This model ensures that all of the items with a similar relationship are in front of the of missing items. We can also divide the missing items \( M_{i} \) into the unknown \( U_{i} \) and dissimilar \( N_{i} \), implying \( M_{i} = U_{i} \; \cup \;N_{i} \). In the training process of ranking tasks, \( j \in P_{i} \) and \( k \in M_{i} \), or \( j \in U_{i} \) and \( k \in N_{i} \) should be guaranteed, and the probability of item similarity \( r_{ij} \) should be greater than \( r_{ik} \). In Bayesian personalized sorting, this relation is called partial relation j > i k.

Based on the above definition, the product recommendations in this paper can be divided into two types of partial relations: \( \{ j\; > \;{}_{i}k\left| {j\; \in \;P_{i} \; \cup \;k\; \in \;M_{i} } \right.\} \) and \( \{ j\; > \;{}_{i}k\left| {j\; \in \;U_{i} \; \cup \;k\; \in \;N_{i} } \right.\} \). The set of all partial relations for item i is expressed as \( R_{i} = \{ \left( {j,k} \right)\left| {j > {}_{i}k} \right.\} \). Therefore, it is observed that the ranking task is the essence of the product recommendation task in this paper. Compared with the classification and fitting, the sorting task can better avoid the imbalance problem. The final goal of this ranking task is to maximize the likelihood probability of the ranking given by:

$$ \hbox{max} \prod\limits_{{{\text{i}} \in R^{I} }} {\prod\limits_{{(j,k) \in R_{i} }} {P(j > {}_{i}k)} } $$
(1)

2.2 Feature Extraction

The SDAE is a deep structure model that connects multiple Denosing Auto-Encoders. For a common Auto-Encoder, it is easy to obtain an identity function if the features of the rating vectors are extracted only by minimizing the error between the input and output. The Auto-Encoders are linked together, rather than maintained as single Auto-Encoders. The reason is that if the noise is added to the user rating of each item, the encoder used in refactoring the input data must be forced to remove the noise. After training, due to the process of noise removal, the feature extraction layer will obtain a function that is more complex than the identity function. To eliminate the influence of noise in the ratings, we add the L2 regularization term [7] into the loss function. This term penalizes an excessively large weight. In addition, due to the background of the massive data available on the Internet, the shallow model has limited ability to express numerous rating vectors and cannot accurately distinguish the characteristics of different items. By contrast, the deep model can obtain the hidden characteristics behind each item’s rating due to its more powerful deep extraction ability. The deep model can extract results more vividly and representatively than the shallow model. The design of the SDAE is described in Table 1.

Table 1. Algorithm of SDAE for feature extraction

In Table 1, H = 3 layers. The structure of the SDAE is U-A-B. Here, U is the input layer of each rating vector. A and B are the first hidden layer and the second hidden layer, respectively. For the original rating x, assign value 0 to some of the data before inputting into the network to obtain \( x^{ \wedge } \) in proportion. Then, \( x^{ \wedge } \) is input, and a greedy strategy is used in the training network of each layer step by step. All of the above steps comprise the pre-training of the SDAE layer. Finally, we use the results of the last layer as hidden characteristics. Considering that the input data are nonlinear and negative, it will be better to choose a sigmoid function as the activation function. Here, W is the weight matrix, and B is the bias vector. The training process is as follows: first, to obtain \( f^{1} \), \( x^{ \wedge } \) is used as the input into the first hidden layer. Then, put \( f^{1} \) into the decoding function in order to obtain \( x^{{{\prime }\left( 1 \right)}} \). For each rating of the items, its reconstruction error function in any layer is given by Eq. (2).

$$ l(x,x^{{\prime }} ) = - \sum\limits_{n = 1}^{N} {x_{n} \log (x^{{\prime }} )} + (1 - x_{n} )\log (1 - x_{n}^{{\prime }} ) $$
(2)

For the whole training set with I items, the error function of the integration in this layer is given by Eq. (3).

$$ J(\theta ) = - \frac{1}{I}\left[ {\sum\limits_{i = 1}^{I} {\sum\limits_{n = 1}^{N} {x^{(i)}_{n} } \log (x_{n}^{\prime (i)} )} + (1 - x^{(i)}_{n} )\log (1 - x_{n}^{\prime (i)} ) + \frac{\lambda }{2N}\sum\limits_{h = 1}^{H - 1} {\sum\limits_{i = 1}^{{I_{h} }} {\sum\limits_{n = 1}^{{I_{h + 1} }} {(\theta^{(h)}_{ni} )^{2} } } } } \right] $$
(3)

The goal of training is to minimize the error function in each step of the iteration. Therefore, to prevent over-fitting, the regularization parameter \( \lambda \) is introduced to avoid the weight becoming too large. For the error function of each layer, the Back-Propagation method and the Stochastic Gradient Descent method are combined to obtain the parameter \( \theta^{l} \) of this layer. Under this parameter, the output \( f^{l} \) of this layer is the input of the hidden layer of the next layer. Repeat the above training process and keep the parameters of each training layer.

2.3 Learning to Rank

The rest of the deep model proposed in this paper is based on the hidden layer H. In the final output layer, the BPR is adopted to rank and learn an output so that the most similar items are ranked at the top, and the final recommended items are selected from the top k items. The training model maximizes the likelihood probability specified by Notation (1), and the loss function of the whole model is given by Eq. (4).

$$ L(\theta_{\text{c}} ,\theta_{r} ) = - \sum\limits_{i} {\sum\limits_{{(j,k \in R^{{\prime }} )}} {P(j > {}_{i}k)} } + \lambda_{1} \left\| {\theta_{c} } \right\|^{2} + \lambda_{2} \left\| {\theta_{r} } \right\|^{2} $$
(4)

Here, \( \theta_{c} = \left\{ {W^{1}_{1} ,W^{2}_{1} ,b^{1}_{1} ,b^{2}_{1} } \right\} \) is the weight and bias of the SDAE part. \( \theta_{r} = \left\{ {W_{2} ,b_{2} ,b_{3} } \right\} \) is the parameter of the other parts. After the pre-training in the SDAE layer, \( \theta_{c} \) and \( \theta_{r} \) were determined again by the Backward Propagation Stochastic Gradient Descent method. When the gradient is stable, the probability that all other items are similar to itself is calculated. Then, a ranking list can be created easily. Then, the system recommends items to users according to this list. After the training, the user rating vector of any two items is used as input, and the probability value of similarity \( r_{ij} = P(e_{ij} = 1) \) between two items is determined. This probability value is the reason why we recommend these items.

Hidden Layer. The input part is \( F_{i} \), \( F_{j} \), \( F_{k} \), which is the feature extracted by i j, k through the SDAE. The purpose of the hidden layers is to embed them in \( H_{i} \), \( H_{j} \), \( H_{k} \) for further calculation. In this layer of the network, we select the ReLU function as the activation function. The hidden layer effect here is not to extract features, so we choose the ReLU function with relatively less information but faster convergence Eq. (5).

$$ H_{i} = ReLU(W^{2} F_{i} + b_{2} ) $$
(5)

Predicting Layer. The input is \( H_{i} \), \( H_{j} \), \( H_{k} \), the output is \( r_{ij} \) and \( r_{ik} \), and the activation function is given by Eq. (6)

$$ r_{ij} = \sigma (b_{3}^{i} + b_{3}^{j} + H_{j} H_{i}^{T} ) $$
(6)

Sigmoid is chosen as the activation function because the probability of the final output should be within [0, 1]. In the previous hidden layer, to improve the efficiency of training, all of the items use the same parameter \( \left\{ {W_{2} ,b_{2} } \right\} \), but each item has its own unique parameter \( b^{i}_{3} \) in the predicting layer. Therefore, it is more likely to explore the inherent potential of each item and to improve the accuracy of the recommendation. The probability of the partial relation between j and k is defined by Eq. (7).

$$ P(j > {}_{i}k) = \frac{{r_{ij} - r_{ik} }}{2} + 0.5 $$
(7)

2.4 Pre-training of \( \varvec{\theta}_{\varvec{r}} \) and Fine-Tuning

In the feature extraction section above, the pre-training method of \( \theta_{c} \) was described. Here, the pre-training method of \( \theta_{r} \) and the fine-tuning method of the whole model are mainly introduced. Table 2 describes the flow of these two algorithms. Later, we will provide a detailed explanation of how each step in the algorithms is implemented in combination with this chart.

Table 2. Algorithm of pre-training \( \theta_{r} \) and fine-tuning.

Pre-training \( \theta_{r} \). The feature \( F_{i} \) generated after training in the part of the SDAE is used as the input, and the output is \( r_{ij} \), \( r_{ik} \). To estimate the parameter, set \( \theta_{r} = \left\{ {W_{2} ,b_{2} ,b_{3} } \right\} \) of project i, and the remaining structural parts must be pre-trained. The similarity relation \( e_{ij} \) between the items in the training set is regarded as a sample of the Bernoulli distribution with a parameter \( r_{ui} \) given by Eq. (8).

$$ p(e_{ij} |r_{ij} ) = r^{{e_{ij} }}_{ij} (1 - r_{ij} )^{{1 - e_{ij} }} $$
(8)

The likelihood probability corresponding to the above equation is defined as follows:

$$ L = \sum\limits_{i,j} {(e_{ij} \log r_{ij} + (1 - e_{ij} )\log (1 - r_{ij} ))} $$
(9)

To estimate \( \theta_{r} \), let us define a function \( r_{ui} = {\text{g }}(F_{i} , F_{j} ) \) that goes from \( F_{i} \), \( F_{j} \) to \( r_{ui} \). Item i and item j are sampled from the positive example set \( P_{i} \) and the negative example set \( N_{i} \), respectively. The negative examples are collected from set \( N_{i} \) rather than from \( U_{i} \) because this approach can greatly improve the training efficiency. Therefore, the logarithmic likelihood probability \( \theta_{r} \) is defined using Eq. (10)

$$ L(\theta_{r} ) = \sum\limits_{i} {\sum\limits_{{i \in P_{i} }} {\log g(F_{i} ,F_{j} ) + \sum\limits_{{j \in N_{i} }} {\log [1 - g(F_{i} ,F_{j} )] - \lambda ||\theta_{r} ||^{2} } } } $$
(10)

For the above equation, we use the Stochastic Gradient Descent optimization. In each iteration of the SGD, the updating method of \( \theta_{r} \) is given by Eq. (11), where \( \eta \) is the learning rate and \( \lambda \) is the regularization parameter.

$$ \Delta \theta_{r} = \eta \cdot ([e_{ij} - g(F_{i} ,F_{j} ) \cdot \frac{\partial g}{{\partial \theta_{r} }}] - \lambda \theta_{r} ) $$
(11)

Fine-Tuning. Fine-tuning is necessary if the pre-training parameters are separately trained. To give the whole entire parameter a better initial space, we adopted the AdaDelta algorithm [8] that is based on the history of the gradient and weight to scale the SGD learning rate and can accelerate the convergence speed of the neural network in the first stage of the training process.

3 Experiments and Analysis

3.1 Data Sets and Evaluation Indicators

In the experimental part of this work, the proposed model is compared with some existing classical collaborative filtering recommendation algorithms. The following experiments are based on the movielens 20M dataset and movielens 1M dataset, respectively. For each method, 10-fold cross-validation is performed on the data set, and the average result is shown at the end.

To measure the performance of the different methods on the specified data sets, we use AUC as an indicator to evaluate the performance of different recommended methods and draw Precision-Recall graphs for intuitive comparison.

Movielens 20M Dataset [9]. This dataset is a stable benchmark dataset. A total of 238,000 users made 27,000 comments on 27,000 movies. These 27,000 movies come with attribute tags and 12 million movie correlation scores.

Movielens 1M Dataset. This dataset is a small dataset with 600 users applying 100,000 ratings and 3,600 attribute tags to 9,000 movies.

UB – CF [10]. User-based collaborative filtering is a collection of similar users based on the user’s rating of the item that measures the similarity between the users, and then organizes them into a sorted catalogue for recommendation based on their favourite items. However, due to the large number of items in the Internet, most users only evaluate a few items, so there is a problem of sparsity that is difficult to solve.

IB – CF [11]. Item-based collaborative filtering uses the interactive information between the users and items to make recommendations for the users. Currently, IB-CF is the most widely used recommendation algorithm. However, the adopted shallow model cannot learn the deep features of users and items.

AUC [12]. Formally, AUC considers the ranking quality of sample prediction, while the ROC curve represents the comparison between the TPR and FPR as the classification threshold standard changes. Therefore, the AUC area is used here instead of the ROC curve as the measurement index. AUC values range from 0.5 to 1, with higher values indicating better performance. To evaluate the recommendation performance, we use the AUC between P and U to measure the model’s ability to rank.

Precision-Recall Curve [13]. We can rank the samples according to the prediction result of the learner. The samples in the first place are the examples that the learner considers “most likely” to be positive, and the samples in the last place are the examples that the learner considers “least likely” to be positive. By taking samples as the positive examples one by one in this order for prediction, the current accuracy and recall can be calculated each time. Equations (12) and (13) are formula definitions, where TP, FP, FN represents true positive samples, false positive samples, and false negative samples, respectively. The Precision-Recall curve is obtained by plotting the accuracy ratio on the vertical axis and the recall ratio on the horizontal axis. If the Precision-Recall curve of one learner is completely wrapped by the curve of another learner, the latter can be asserted to have better performance than the former.

$$ P = \frac{TP}{TP + FP} $$
(12)
$$ R = \frac{TP}{TP + FN} $$
(13)

In order to evaluate the ranking quality of the recommendation results, we also adopted NDCG@n as one of the recommendation indicators.

NDCG [14]. Normalized Discounted Cumulative Gain (NDCG) is the ratio of the DCG to the described Ideal DCG, which means its similar items P are always ranked before the rest of the items. The higher NDCG value indicates a better learning performance. Commonly, the NDCG@n that calculates the NDCG result over the top ranked n items are used in the recommendation tasks. The NDCG result is described as follows:

$$ N(n) = Z_{n} \sum\limits_{j = 1}^{n} {(2^{r(j)} - 1)/\log (1 + j)} $$
(14)

In formula 14, j represents the number of goals we want in the results and \( Z_{n} \) represents the normalization. In our experiments, we calculate NDCG@5 for each item and average them as a metric. Meanwhile, we also try different n for detailed estimation.

3.2 Result Analysis

To measure the performance of these models on datasets of different sizes, these models are validated using the movielens 20M dataset and the movielens 1M dataset, respectively. This validation is performed by calculating the AUC of the results and drawing P-R graphics to compare the comprehensive performance of the model. It was found that the AUC of the SDAE-BPR and the P-R curve are is higher than those of the classical algorithms. At the same time, the NDCG@n of the SDAE-BPR is also larger than those for the other two classical collaborative filtering algorithms.

AUC. As observed from the examination of the data presented in Table 3, the AUC of the SDAE-BPR is 2.6% higher than that of the IB-CF. The AUC of the SDAE-BPR is also 4.1% higher than that of UB-CF for the movielens 20M dataset. For the movielens 1M dataset, the AUC of SDAE-BPR was 1.7% higher than that of IB-CF and 3.1% higher than that of UB-CF. Based on the horizontal comparison, with the increase in the size of the training samples, the model can fit better. Therefore, all of these three methods perform better for large datasets than for small datasets. From a longitudinal perspective, the SDAE-BPR has the best performance in each training set followed by the IB-CF and the UB-CF. UB-CF shows the worst performance because when the number of items is too large, each user can only evaluate a few items. Thus, it is difficult to find enough users who are very similar in the training set. Because it measures the similarity between the items, the IB-CF can find many similar items in the training set. In addition to calculating the similarity of the items, the SDAE-BPR also deeply extracted the user evaluation vector to obtain the unique characteristics of each item. This deep extraction ensures the higher accuracy of the recommendation results. Furthermore, we can easily find that the difference in AUC is larger than that in NDCG@6. Here, we can assume that the main function of the SDAE-BPR is to rank a better result.

Table 3. Comparison of AUC and NDCG@6 results.

Precision-Recall Curve. In Fig. 3, the P-R curve of SDAE-BPR basically wraps the curves of the other two models. This finding is also observed in Fig. 4. According to the definition of the P-R curve in the previous paper, we can conclude that the comprehensive performance of the SDAE-BPR is the best in both large and small data sets. The conclusions drawn here are in accordance with the results of the AUC. By comparing these two results, we can clearly determine that the SDAE-BPR has better comprehensive performance than the classical collaborative filtering algorithm. However, in Fig. 4, the P-R curve of all methods nearly overlap. Even with the increasing amount of training data sets, in Fig. 3, their Precision-Recall curves were not becoming sufficiently different. Based on this phenomenon, we infer that significantly improving the performance of the model is not the advantage of the SDAE-BPR. To verify this assumption, analysis of the NDCG must be performed as described in the next section.

Fig. 3.
figure 3

Precision-Recall curves on movielens 20M

Fig. 4.
figure 4

Precision-Recall curves on movielens 1M

NDCG@n. First, we need to choose a value of n, that is, to ensure that the NDCG can be as large as possible, while the training cost is as small as possible. It is observed from Fig. 5 that the NDCG will be stable when n equals 6. This condition is also observed for the results presented in Fig. 6. Therefore, we choose 6 as the value of n. From Table 2, in each dataset, the NDCG@6 is much larger than other two classical collaborate filtering algorithms. The curve of the SDAE-BPR in Fig. 5 is much higher than the other two curves. In Fig. 6, the differences still stay the same. This result means the SDAE-BPR always has more accurate rank results than the classic methods. Therefore, the SDAE-BPR has the highest rank quality among these three algorithms. In addition, the difference of the NDCG@6 among all methods is also much larger than that of the AUC. This result is because the goal of the SDAE-BPR in training is to maximize the difference between the positive example probability and the negative example probability, rather than simply to calculate the similarity of each item after fitting. It was found that the SDAE-BPR model is more suitable for a small number of precise recommendation application scenarios. In addition, for each different item i, the bias \( b_{i} \) belonging to this item is added, also contributing to improving the quality and accuracy of the recommendation ranking results.

Fig. 5.
figure 5

NDCG@n on movielens 20M

Fig. 6.
figure 6

NDCG@n on movielens 1M

4 Related Work

The most popular model for the recommender system is k-nearest neighbour (kNN) collaborative filtering. [15] Recently, matrix factorization (MF) has become very popular in recommender systems both for implicit and explicit feedback. In early work, [16] singular value decomposition (SVD) has been proposed to learn the feature metrics. The MF models learned by SVD have been shown to be highly prone to overfitting. Below, we review some of the better methods mentioned in this paper.

Deep Learning. Deep learning has become highly popular on the Internet for big data and artificial intelligence [17]. Deep learning, by combining low-level features to form denser high-level semantic abstracts, can automatically discover the distributed feature representation of data. Deep learning can solve the problem of manual design features in traditional machine learning and has achieved breakthroughs in image recognition, machine translation, speech recognition, online advertising and other fields. In the field of image recognition, the accuracy rate of deep learning exceeded 97% in the 2016 ImageNet image classification competition. In the field of machine translation, the Google neural machine translation system (GNMT) based on deep learning has achieved a translation level close to that of humans in the field of English to Spanish and English to French [18]. In the field of online advertising, deep learning is widely used to predict the click rate of advertisements and has achieved great success in its application by Google [19], Microsoft [20], Huawei [21], Alibaba [22] and other enterprises. Deep learning involves a wide range of machine learning technologies and structures. The SDAE used in this paper belongs to this deep learning structure.

SDAE. Auto-Encoder is a common method used in feature extraction by neural network [23, 24]. These networks are trained to reconstruct their inputs by dimensionality reduction, resulting in better characterization than the original data. Common methods for extracting features of neural networks include the Convolutional Neural Network (CNN) [25] and Recurrent Neural Network (RNN) [26]. However, the CNN is a kind of multi-layer perceptron, which is mainly used to process two-dimensional or three-dimensional image data. Additionally, the nodes between each hidden layer of the RNN are connected and able to memorize the past information, so that this method is more suitable for sequence modelling. Considering that the scoring data in the recommendation system are all one-dimensional values without sequence relations, methods such as the CNN and RNN cannot play a meaningful role in feature extraction but will increase the computational complexity. Therefore, the stack denoising auto-encoder may be a better choice [27]. The advantage of stack structure [28, 29] is that multi-hierarchy abstract data representation and deeper implicit data representation can be obtained by stacking multiple automatic encoders.

BPR. From an algorithmic point of view, the existing methods of recommending problems can be approximately divided into three categories: classification, fitting, and ranking. [30, 31] The classification method can be regarded as a binary classification problem, using the predefined features to train the classifier, and finally using the classifier to predict the similarities between the items. The method of fitting is to convert the scores of items into a real value rating matrix and to use a collaborative filtering method such as matrix decomposition to predict the similar probability of items without scoring. The problem of data sparsity will make classification or fitting methods be biased towards dissimilarity. [32] The sorting method considers the recommendation as a sort of learning task. For each item, by ranking the similarity probabilities of other items and itself, it is guaranteed that these similar items are ranked higher than the dissimilar items, and this sorting method proved to be effective and can solve the imbalance problem. Among these models, the BPR model [33, 34] defines the Bayesian pairwise ranking relationship between the items. The relationship is that probabilities of similar items should be greater than those of the non-similar items. This model has been verified to achieve good performance.

5 Conclusions

In this paper, the existing BPR model and the SDAE are combined to recommend products. The SDAE is used to extract the implicit characteristics of the user evaluation vector, and the already extracted BPR is based, in part, on deep hidden features to obtain the features of the products and, on the basis of the entire model, to propose a set that is suitable for the preliminary training of the model; then, an optimization strategy is used to speed up the training efficiency and improve the recommendation accuracy. As demonstrated in the experimental verification, the model proposed in this paper has better performance than the existing classical collaborative filtering commodity recommendation algorithm, obtaining higher accuracy and better sorting of the results and avoiding the impact of sparsity. However, this model still has some shortcomings. For example, when the data volume is large, features are extracted for each item, and the similarity probability of any two items must be calculated. The model learning and data preservation may encounter some difficulties. However, the application of deep learning in the recommendation system has been studied by many researchers and has been proven to be feasible. Therefore, we believe that this research direction is promising.

In the future, we will study the interpretability of the algorithm. Currently, we can only provide the answer about probability to the user. It is not yet possible to convincingly show why the probability should have this value. Furthermore, the users’ interest in different goods on e-commerce platforms changes rapidly with time. It will be useful to connect our methods to the changes in user interest.