Keywords

1 Introduction

In the environment of Big Data and E-commerce, machine learning has widened its scope of applications. Table 1 shows the relationship between E-commerce, Big Data Analytics (BDA) and Neural Networks (NN) as three entities intermixed and interdependent, with the goal of creating a profitable domain for the E-commerce industry (Figs. 1, 2 and 3).

Table 1 Big data analytics in E-commerce

As the pool of online content widens and deepens, users get a plethora of choices. But this often poses ‘paradox of choice’. Recommendation systems are expert systems that harness the power of data to enhance the decision-making process for the user.

Recommendation systems gather information in three ways: Implicitly, Explicitly [1] and by Hybrid methods. In Implicit feedback, user behaviour is captured from the actions made by the user while accessing the application. It does not require user intervention. In Explicit feedback, information is gathered by explicit user intervention. Hybridization of feedback is achieved in two ways: (i) User is allowed to participate in explicit feedback gathering, and (ii) implicit data as a check on explicit feedback.

Recommendation systems can be broadly classified into Content-based [2] (based on the features extracted from the items that the user has rated positively).

Collaborative filtering based [3, 4] (recommends items to the target users by mapping similar users) which can be further categorized into item-based and user-based

  • Item-based: Find the correlation between items based on user’s previous rating.

  • User-based: Finds a correlation between target user’s and other users’ profile.

The third major category of recommendation systems is Hybrid [5], which is a combination of both content-based and collaborative filtering (CF).

Major issues in recommender systems are cold start, shilling attacks, synonymy, grey sheep, Limited Content Analysis and Overspecialization, privacy, latency, Evaluation and Availability of Online Datasets, context awareness, sparsity and scalability [6, 7]. However, due to the inherent nature of CF, cold start, sparsity, and shilling attacks are the most prominent problems. The 1990s saw information overload problems [3] as a major challenge in context-based recommender systems. Amazon was one of the earliest commercial applications of CF. Over the years CF algorithms have advanced in complexity and performance. The algorithms under CF are grouped as:

  1. 1.

    Memory-based [8, 9]: they employ statistical techniques in the user-item matrix to find the nearest neighbours of the users and make recommendations.

  2. 2.

    Model-based [10]: Huge datasets are compressed and fed into a model that generate recommendations. Matrix Factorization (MF) and clustering are examples of traditional methods. In the recent past deep-learning-based methods have also gained popularity [11, 12]. Algorithms under deep learning can be of the following type:

    1. (a)

      Multi-layered Perceptron (MLP): it is a feed-forward NN architecture with hidden layers, where back-propagation is used for learning.

    2. (b)

      Convolutional NN (CNN) [12, 13]: It incorporates convolution (a mathematically represented linear operation) on data projected in a grid.

    3. (c)

      Recurrent Neural Networks (RNN) [12, 14]: These NN are good for modelling sequential data. They can remember former computations.

    4. (d)

      Restricted Boltzmann Machines (RBM) [12, 15]: These generative stochastic NN with 2 layers produce a random probability distribution of the input.

    5. (e)

      Autoencoders (AE) [16]: AE is an unsupervised NN model attempting to reconstruct its input data in the output layer.

Lots of content has been published where the surveys are provided for deep-learning based recommendation systems that use CNN, RNN, or RBM as the architecture, whereas this paper focuses on providing a survey of literature and approaches that use CF for recommender systems using autoencoders as the architecture.

1.1 Research Challenges and Future Direction

It is established by this study that the benchmark datasets are available for research in the area of recommender systems, however, it is strongly recommended that more datasets pertaining to other fields of recommendation should be generated so that the recommender systems for other fields can also be developed. One striking observation that has come up during this survey is that the literature available for this area lacks in comparative analysis wherein all the models are compared experimentally. The comparative analysis of all models can be carried out experimentally so that better insights can be projected and application-specific choice of model can be suggested.

The rest of the paper is organized as Sect. 2 discusses AEs and their types. Section 3 presents the literature available for the subject of the paper, various modes for evaluating the approaches and datasets used for comparing the approaches in the area. Section 4 throws the light on a comparison between various works reported in the literature for recommendation systems using CF and AEs. Section 5 concludes the study with major insights and throwing open-ended research issues and challenges.

2 Autoencoders

In general, Autoencoders consists mainly of four main parts [16]: Encoder, bottleneck, Decoder, and reconstruction Loss method. The encoder learns the way to input dimensions, compresses the input into an encoded representation. The bottleneck layer is used as a salient feature representation of the input data. The bottleneck layer represents the data in the lowest possible dimensions. The decoder reconstructs the encoded data into a representation that is as close to the input as possible. The reconstruction loss method is responsible for making out the losses incurred during this encode-decode process. Training an AE means minimizing the network’s loss using back propagation (Fig. 4).

Fig. 4
figure 4

Types of autoencoders

A vanilla AE is a simple two-layered NN with one hidden layer, where the length of the encoder and decoder is the same, and that of the hidden layer is smaller than them. A Denoising AE [17] introduces noise in the input, and then it reconstructs this corrupted data to produce cleaned output. Adversarial AEs consists of a discriminator and a generator. The discriminator generates the probability of a point x belonging to the data distribution, while a generator generates data that is fed to the discriminator intending to fool the discriminator [18]. Contractive AE [19] adds a regularizer to the objective function of the AE. This regularizer belongs to the Frobenius norm of the Jacobian mix. Variational AE maps the input to a distribution, where the bottleneck consists of a mean vector and a standard deviation vector.

3 Literature Survey

The recent past has seen a shift from matrix factorization and towards deep learning approaches with CF for recommendation systems. Of the numerous architectures available in the literature, autoencoders exactly fit the CF problem. This section discusses and critically reviews various models proposed in the literature that use AE as underlying architecture, and then presents factors of evaluation for these models. This section also discusses prominent datasets used in research in recommender systems.

One of the integrations of AEs in CF was introduced as AutoRec [20], a state-of-the-art model for recommendation. It has two variations: I-AutoRec (item based), and U-AutoRec (user based). It takes as input partial vectors of either users or items and aims to project it to a lower dimensional latent space in the hidden layer. It projects the reconstructions to the output layer, which are the recommendation results. It uses identity and sigmoid activation functions. By comparing RMSE values, authors conclude that I-AutoRec outperforms U-AutoRec on MovieLens 1M and 10M dataset.

Denoising Autoencoders (DAE) [17] introduce the concept of adding noise to the input. Strub and Mary [21] introduced an AE model that computes non-linear matrix factorization from sparse rating inputs. It has two versions: Uencoder and Vencoder. It converts the missing values to zero in the input and backpropagated layers. This is known as masking noise, which helps make recommendations. Input is sparse representation of users and items, and gives a dense output, hence addressing sparsity issues. The authors report V-AutoRec outperforms Uencoder and Vencoder.

Marginalized Stacked DAE (mSDA) [22] uses linear denoisers to marginalize out random feature corruption. It uses Stochastic Gradient Descent (SGD). This architecture addresses the issues of scalability, high-dimensional features, and high computation costs. Deep CF (DCF) model intermixes MF methods with marginalized DAEs, where the time complexity is O(tN) (t is the number of iterations, N is the number of ratings). mSDA-CF, a variant of mSDA, stacks marginalized DAEs, and updates latent features on the layers given by the average of total number of layers and 1. The authors claim that this significantly improves time complexity.

Collaborative Denoising AE (CDAE) [23] is a generalized model of several other state-of-the-art. It generalizes Latent Factor Model [24, 25] and Factorized Similarity Model. CDAE has I + 1 node in the input layers, where I is the number of items, and the last node is a user-specific node. The hidden layer has a bias node, and the output has I nodes. Layers are fully-connected. CDAE uses SGD to learn parameters.

Based on Stacked DAE (SDAE), a CF model was developed [26]. It is user-based in nature and uses Gaussian noise to corrupt the data. It uses output of hidden layers for training process. Built upon it is CF-based Stacked Denoising AE model CF-SDA. It calculates the difference between the latent similarity obtained from the aforementioned model and the surface similarity. It employs Sigmoid and Identity mapping activation functions, Mean Squared Error as loss function and Adam optimizer.

Recommendation via Dual AEs (ReDa) [27] is a representational learning framework that learns latent features from user-item data and aims to minimize deviations in the training data. It uses stacked AEs to reach an optimal global solution. ReDa uses Sigmoid function during encoding and decoding. The algorithm requires calculation of partial derivatives of the variables used in the equations of optimizations. The model uses gradient descent for optimization, until convergence, which is not guaranteed every time as the optimization problem proposed is not convex in nature.

Collaborative Adversarial AE (CAAE) [28] is a framework that is based on Generative Adversarial Networks (GAN) [18]. A GAN has a discriminator NN and a generator NN. The discriminator generates the probability of a point belonging to a data distribution, and the generator generates data to feed in the discriminator, to fool it. CAAE consists of a discriminator that uses L2 regularization and is implemented with Bayesian Personalized Ranking (BPR) for learning relative preferences on items. It has a positive and negative item generator AE. SGD is used for the learning process.

Fine-Grained Collaborative AE (FG-ACAE) [29] is an adversarial framework for non-linear model optimization, where the noises iteratively minimize and maximize the loss function. This framework is applied to CDAE [23], where the noise mixing layer replaces denoising layer. Identity activation function is used in the mixing layer, loss function is cross-entropy, and mini-batch gradient descent for optimization.

Mult-VAEPR [30] extends Variational AE (VAE) [31, 32] for CF for implicit feedback incorporating non-linear probabilistic model. The model samples K-dimensional latent representations from the Gaussian prior, and non-linear function \({f}_{\theta }\) are applied to it to produce probability distribution over the items. Softmax function is used for output normalization. The model uses variational inference to minimize the Kullback-Leiber (KL) divergence. Using multinomial likelihood for the data distribution and adjusting the over-regularized VAE objective gives significant improvement over state-of-the-art baselines. It uses Bayesian inference for parameter estimation.

Sequential Variational AE (SVAE) [33] conditions each event to the previous event, thus modelling temporal dependencies. The authors argue that there is a recurrent relationship between the current datapoint and the previous one.

Queryable Variational AE (Q-VAE) [34] models the joint probability of the user’s preferences and the conditional probability of other similar user’s preferences. It models the log joint probability of the subset of partial preferences of the user, where the partition of this subset is arbitrary. Authors claim this is different from the existing VAE based models. It uses KL divergence for regularizing posterior distributions.

RecVAE [35] is built upon Mult-VAEPR. It changes the encoder of Mult-VAEPR to a denoising encoder. The decoder uses Softmax activation. It makes a convex combination of standard Gaussian prior and a regularization term in the form of KL divergence. The training takes alternatively on the encoder and the decoder. Monte-Carlo sampling is used for log-likelihood and KL divergence. It uses cross-entropy as the main component of the loss function. The authors claim it outperforms other methods.

AutoSVD++ [36] is a hybrid CF model which uses Contractive Auto-Encoders (CAE) [19] and SVD++ [37]. Atop this is, a new hybrid model that takes implicit feedback as input. The model takes high-level feature representations from CAE, which is integrated with AutoSVD++ model. The model tries to minimize the regularized squared error loss. It uses SGD to learn parameters. Authors claim that their model outperforms mSDA-CF and U-AutoRec. They claim that this model is scalable.

Semi Auto-Encoder (SA-HCF) [38] proposes an architecture where the output layer is shorter than the input layer which makes incorporating side information easier. Based on it, the authors have designed a hybrid CF model. The authors claim it can solve the problem of rating prediction and ranking prediction. For ranking prediction, the user’s partial observed vector and profile vector is used, and for rating prediction item’s partially observed vectors and explicit ratings are used.

Hybrid collaborative filtering (HCF) model with semi-stacked denoising AE (Semi-SDAE), referred to as HCF-SS [39] is a hybrid model. The Semi-SDAE takes three inputs: the original, the noise-corrupted, and the side information data. Semi-SDAE can incorporate extended side information as input, without changing the dimension of the output layer. The model uses two layers of Semi-SDAE which is incorporated into MF. The parameters are learnt using SGD. The authors demonstrate that HCF-SS model outperforms AutoSVD++, ReDa and SA-HCF.

AutoCOT [40] is based on cooperative training (COT). It implements AutoRec and combines COT model with it. The mediator model generates a mixture distribution of real and generated user browsed data, and aims to optimize the KL divergence, and the generator model optimizes the JS divergence between them. The mediator and generator are one hidden layered architecture. The training process occurs iteratively on them. Authors claim AutoCOT can alleviate sparsity issues in large datasets.

Deep Heterogeneous AEs (DHA) [41] utilize information from multiple domains for accuracy improvement. It uses SDAE and RNN to extract latent features from non-sequential and sequential data respectively. It uses two Long Short Term Memory networks for the sequential data. The model incorporates two DHAs. It uses coordinate descent for optimizations and SGD for learning weight matrix and bias vectors. The author’s experimentations demonstrate competitive model performance.

Built upon U-AutoRec, is an AE based CF (ACF) framework that provides top-N recommendation. It employs two AEs: PR-Net and IL-Net [42]. PR-Net employs pairwise regularization and samples negative samples, while IL-Net captures the items which are more likely to be true negatives. PR-Net is fed explicit feedback and IL-Net is fed the implicit feedback. SGD is used to learn models. The authors claim that this framework can also be extended for one-class CF by altering the inputs.

EASER [43] is a linear model with zero hidden layers. It takes a user-item interaction matrix X (generally binary) as the input. The diagonal of the item-item weight matrix must be constrained to zero. This model uses square loss between X and predicted scores. EASER states that the constrained convex optimization problem has a closed solution. The authors state the computational complexity of the algorithm as O(|I|3) where I is the number of items. Authors claim the model’s superior accuracy.

AE algorithm for CF (AE-CF) [44] presents a combination of AE with clustering. The scoring data is combined with the user data to construct a scoring matrix, for extracting user features. This is used to plot the clusters, and using k-means recommendations are made to the user. This algorithm narrows the search space. The AE is used here for dimensionality reduction. It uses Pearson’s correlation coefficient.

3.1 Matrices Used for Evaluation of Approaches

Models surveyed in this paper use Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Precision and Recall, Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR) as evaluation metrics. These are very well-known metrics and have been thoroughly studied in the literature.

3.2 Datasets

Table 2 mentions the highlights of the major datasets used in research related to recommendation systems and existing in the literature.

Table 2 Datasets used in the models

4 Comparison of Various Approaches Surveyed in the Study

This section provides a comparison of various approaches reported in the last decade and existing in the literature for making the recommendation based on CF and AE as architecture. The approaches are reviewed based on the model of AEs used in the approach, the category of information taken for applying CF, the datasets used. Table 3 also summarizes the methodology adapted, results reported, and their comparison with other contemporary approaches in the light of evaluation metrics (Table 4).

Table 3 Comparison of various approaches surveyed in the study
Table 4 Acronyms used in the table and their full forms

5 Contribution of the Survey

In this paper, multiple models based on AE architecture for collaborative filtering are surveyed. It forms an informative basis for decision making when trying to select a model for implementation. This survey also highlights the limitations in these models, which can inspire innovation of new models that will overcome those limitations. Similarly, the plus points of the models surveyed can inspire new models to build upon one or more of them, therefore combining the positives of multiple models.

6 Conclusion and Future Work

In this paper, a thorough survey has been conducted on AE based models that employ CF methods for recommender systems. Through the survey, it is well evident that the research in the field of recommender systems is well flourishing and has a great scope of applications. In the literature, it is proposed that, as the number of layers in the AEs is increased from two to four, they perform slightly better. Also, the resultant setup behaves more robustly when the number and type of hyperparameters are changed. Through the survey, it has also been found that HCF-SS, EASER. HCF-SS and AE-CF perform great, in terms of evaluation metrics such as RMSE, MAE, Precision and Recall and NDGC when used with MovieLens as a dataset. A future work might be that a combination of two or more models can be implemented, for example, EASER with AE-CF, to see their synergetic effect on recommendations made.