KFDBN: Kernelized Finetuned Deep Belief Network for recommendation

Idrissi, Nouhaila; Zellou, Ahmed; Bakkoury, Zohra

doi:10.1007/s11042-023-15208-0

KFDBN: Kernelized Finetuned Deep Belief Network for recommendation

Published: 17 August 2023

Volume 83, pages 23599–23634, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

KFDBN: Kernelized Finetuned Deep Belief Network for recommendation

Download PDF

104 Accesses
Explore all metrics

Abstract

In today’s technologically evolved world, users have become accustomed to personalized tools that provide accurate and precise recommendations that consider their needs and interests, leading to a perceived reduction in information overload in a fiercely competitive marketplace. However, sparsity issues render the effective prediction infeasible and hamper the performance of the recommendation, as users typically only rate a small proportion of the offered items. To solve these problems, we propose, in this paper, a novel hybrid model named Kernelized Finetuned Deep Belief Network (KFDBN) for accurate rating prediction and efficient Top-k recommendation. First, we learn a deep generative model using a Deep Belief Network (DBN) to capture higher-level latent feature representations of users and items to effectively predict missing entries in the rating matrix. Next, we present the practical KFDBN approach to seamlessly incorporate features extracted by the generative model into kernel-based feature extraction. Subsequently, KFDBN is leveraged to learn reliable hidden features that improve the performance of sparse recommendations. Finally, we introduce the KFDBN-based imputation technique to create a denser user-item interaction matrix for a relevant item ranking in the Top-k recommendation. Experimental evaluations on six datasets in various domains reveal that the proposed generative model enhances the accuracy of rating prediction and outperforms the state-of-the-art methods by 21.32% and 16.35% in RMSE and MAE, respectively, averaging on all used datasets. Moreover, the KFDBN exceeds baseline approaches and enhances the Top-k recommendation performance by 5.2% and 8.1% in HR@10 and NDCG@10, respectively, averaging on the six datasets.

Deep Bayesian Matrix Factorization

A Hierarchical Bayesian Factorization Model for Implicit and Explicit Feedback Data

Bayesian dual neural networks for recommendation

Article 16 July 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The expanding demand for personalization has paved the way for the emergence of Recommender Systems (RSs) to defeat the tyranny of prevailing data overwhelm by predicting users’ intentions. The efficiency of such a system is generally determined by how effectively diverse users’ interests are identified [1, 17], how accurately users’ transactions on items are represented [17, 32], and how significant user and item features are extracted [17]. Collaborative filtering (CF) is a fast-evolving technique [18, 36] that relies on feedback scores to infer similar users and items. The Netflix prize competition, which began in 2006, has led to several recent advances in this area and pioneered the research community’s access to movie rating datasets, which drew the interest and attention of many researchers and scientists [4]. However, comprehending precisely varied users’ preferences demands complex models requiring massive information, which runs counter to the sparsity limitation. The latter refers to the situation where available data of users’ interactions with items is sparse [17], making it challenging for the RS to infer accurate predictions.

The continuous upsurge of better prediction accuracy has led to the use of Deep Learning (DL), a subfield of Machine learning (ML), to propose robust and powerful RSs. The current growth of interest in such deep models in the recommendation field is due to novel methods that showed practical training, thus outperforming shallow techniques, especially in sparse conditions [17, 23, 24, 44]. On the other hand, kernel methods have also gained a keen interest in RSs, due to their meaningful feature extraction and efficient structure discovery in data. Recent approaches based on kernel techniques leverage the kernel trick to explore relevant structures in non-linear empirical data in different contexts, thereby capturing reliable features for better prediction [7, 30], or classification [31, 44], as long as the kernel function is selected accurately.

Several state-of-the-art papers integrate side information into DL-based models as an inevitable choice to propose hybrid solutions for missing rating prediction [22, 24]. The additional information could be items properties [24], textual reviews [43], or simply demographics gathered from users’ profiles [37]. Nevertheless, accurate personalization is not about names, ages, or places, as it does not track users’ volatile tastes and behaviors [17]. Not to mention that such data is hard to collect due to privacy aspects. On the other hand, to generate Top-k recommendations, consisting of ranked and personalized lists of items to users, some papers employ auxiliary information in Matrix Factorization (MF) techniques as regularizations to learn latent features [33, 48]. However, such feature vectors represent linearly user and item interactions, which is notably not useful for heterogeneous real-world data, especially if the user-item matrix is scarce and additional information is unavailable [17].

To address these problems, we introduce, in this paper, the Kernelized Finetuned Deep Belief Network (KFDBN) model to accurately predict missing ratings and generate efficient Top-k recommendations even if only a few ratings of users on items are available. We first learn a deep generative model using a Deep Belief Network (DBN) for missing rating prediction. The model captures higher-level hidden feature representations of users and items by allowing reliable layer-by-layer learning of weights, resulting in better generalization. The deep structure is performed by stacking multiple Restricted Boltzmann Machines (RBMs) to learn the visible layer’s distribution probabilistically, allowing each successive RBM to apprehend meaningful features in the sparse rating matrix. We use the Contrastive Divergence (CD) algorithm [13] to initialize the deep architecture of the generative model in an unsupervised way. Such initialization avoids the typical problems of gradient-based approaches that are time-consuming and have to deal with many layers and parameters throughout the learning process, which might lead to local minima [23]. Next, the compressed data representation and the restricted parameter space establish a starting point for supervised fine-tuning. This optimization ensures further effective generative reconstruction of user-item interactions yielding accurate predictions and personalized recommendations.

We also present the efficient KFDBN classifier that emphasizes extracting powerful hidden features of users and items, unlike classification-based CF approaches, which merely employ shallow feature extraction. We demonstrate that KFDBN can be successfully applied to enhance classification performance for different datasets by using the deep generative approach as a base pre-model that produces, along with Kernel Principal Component Analysis (KPCA) [35], meaningful latent feature vectors to conduct an efficient two-step kernel-based discriminative semi-supervised classification.

Another crucial contribution of this paper is to develop a novel imputation technique to solve sparsity issues in MF-based recommendations. We use KFDBN as a hybrid imputation technique to infer missing entries in the original sparse rating matrix. The positive predictions emanated from the generative model are merged with ratings of relevant items resulting from the KFDBN classifier component to create a denser user-item rating matrix for the Top-k recommendation. This leads to solving the optimization issues that occur when MF-based methods are applied directly to sparse data since the error function considers only available inputs. On the other hand, the proposed unified model discovers the underlying local and global non-linear correlations on sparse data to infer missing user-item entries, which significantly differs from state-of-the-art MF-based recommendation approaches [11, 34] that treat unobserved ratings as negative inputs compensating for the skewed distribution of known interactions.

The contributions of this paper are summarized as follows:

We propose a novel hybrid model named Kernelized Finetuned Deep Belief Network (KFDBN) that combines the practical training of the DBN with kernel methods’ relevant exploration of data structures to overcome sparsity issues in RSs. We first present the effective fine-tuned (FDBN) model for accurate missing rating prediction.
We show that going deeper helps to capture reliable features by comparing the performance of RBM with DBN.
We demonstrate that supervised fine-tuning of the generative model yields better prediction results.
We present a practical hybrid approach to seamlessly incorporate features extracted by the generative model into kernel-based feature extraction using KPCA to create user profiles. The unified KFDBN feature extraction helps to apprehend important information among items by discovering correlation features among co-rated items from users’ perspectives, capturing thus precise and relevant items that satisfy users’ preferences. The proposed user profile expansion mechanism significantly impacts the performance enhancement of the sparse recommendation.
We prove that output features extracted using the hybrid KFDBN lead to higher Support Vector Classification (SVC) outcomes, demonstrating that KFDBN can be successfully extended to effective semi-supervised classification.
We develop a novel imputation technique that infers missing ratings in the original sparse user-item interaction matrix to solve sparsity issues in MF-based recommendations.
Finally, extensive experimentation on six sparse datasets in diverse domains shows that the FDBN generative model significantly enhances the missing rating prediction and outperforms the state-of-the-art methods by 21.32% and 16.35% in terms of RMSE and MAE, respectively, averaging on all used datasets. We also empirically prove that the KFDBN leads to higher performance for the Top-k recommendation and yields 5.2% and 8.1% improvement in HR@10 and NDCG@10, respectively, averaging on the six datasets.

The remainder of this paper is divided into five sections. Section 2 provides an overview of related work in DL-based feature learning and kernel-based methods applied in RSs. The foundations and preliminaries of RBM, DBN, and Kernel methods are presented in Section 3. Section 4 is dedicated to thoroughly describing the proposed KFDBN and discussing the different components. Section 5 reports experimental results, answers the research questions addressed in this paper and discusses the findings. Finally, the paper conclusions and future direction of the work are presented in Section 6.

2 Related work

The related work of the paper covers two chief points: DL-based models for enhanced feature extraction and kernel-based technique applied in RSs.

2.1 Enhanced-feature extraction based on deep networks

Autoencoders (AEs) are widely employed in RSs for their appropriate latent feature representation. Bathla et al. [3] uses a shared layer in an AE to incorporate indirect social trust with ratings and learn accurately from linked representations. Hybrid AE-based models include users’ side information to address data scarcity [46]. However, users’ latent information is typically diverse and multimodal and may contain improper aspects for prediction. As a result, emphasizing such features by the AE will cause overfitting. Not to mention that reconstruction outputs of AE-based approaches are sometimes directly leveraged as predictions of unknown rating values [23], which may impair the effectiveness of the recommendation.

CapsMF [22] leverages Bi-directional Recurrent Neural networks (RNN) with Capsule Networks for textual analysis of reviews to generate recommendations, thus enhancing the capabilities of Convolutional Neural Networks (CNN) and RNN-based models in text modeling and representation. On the other hand, Capsule networks need to be fine-tuned since they take time to be trained, especially for large-scale recommendations.

Generative Adversarial Networks (GAN) are employed to learn user and item representations [9, 29] and perform rating augmentation [6] to alleviate the sparsity in CF. However, GAN-based models tend to predict high-value ratings for most missing entries in the user-item matrix [6]. Hence, if proposed approaches adopt such a biased augmented matrix for generating recommendations, they are likely to result in poor performance as they fail to apprehend users’ actual interests. Furthermore, such models’ discriminator sometimes leads to a high percentage of contradicting labels for the same items due to early and misleading convergence [29]. Additionally, inverting in a wholly trained GAN-based system is a tedious task.

2.2 Kernel-based methods for recommendation

Though linear methods are simpler to interpret and have shown promising prediction results [38], they may not be proper to train the non-linearly separable and heterogeneous real-world data [17], which promoted the application of kernel methods in RSs. Chen et al. [7] proposes kernel-based matrix completion for a neighborhood-based recommendation. Approximation for missing ratings is enhanced by integrating the Gaussian kernel function with the similarity measure. The technique combines various characteristics in non-linear latent feature spaces by employing a multi-kernel framework, adjusting the kernels’ weights to generate relevant recommendations.

The Kernel Context Recommender System (KCR) employs the kernel trick to build a model based on a context rating matrix for prediction using contextual information [21]. The research shows that variation in kernel functions leads to higher performance than standard context-aware RSs. However, Principal Component Analysis (PCA) could be further applied to filter relevant context features to enhance the efficiency of the kernel mapping framework. Besides, investigating a DL-based model can learn from diverse representations from varied contexts, thus improving the quality of recommendations.

To transcend the idea of incorporating mapped features of similar users directly into classifiers to generate recommendations, [31] used the Multiple Kernel Learning for feature separation; then incorporated output results in the Adaptive Neuro-Fuzzy Inference System (ANFIS) for classification. The proposed model has shown relevant sensitivity and specificity. However, it has not been evaluated in scarce conditions. Furthermore, it is relevant solely for heart disease recommendations. Another paper [44] combined the Scale-Invariant Feature Transform (SIFT) with the Faster R-CNN algorithm and Support vector machine (SVM) classification for music recommendation based on DL and IoT architecture. First, SIFT is used to locally extract features based on the Gaussian convolution kernel. Then, Fast-RCNN is utilized to extract underlying multi-scale features of the scene images. Finally, middle-layer characteristics with spatial information are acquired using SVM classification to develop the background music system. Otherwise, Singular Value Decomposition (SVD) may be leveraged to infer low-dimensional vectors of users’ data [30]; after that, a kernel function can discover relevant latent features between users before predicting ratings using neural-based models such as Multilayer perceptron (MLP) [30].

3 Background preliminaries

3.1 Restricted Boltzmann Machine

A Restricted Boltzmann Machine is a stochastic generative Neural Network (NN) that learns a probability distribution over its set of inputs [13]. The weighted bipartite graph of the RBM, $\theta =(W_{ij},v_{bias_{i}},h_{bias_{j}})$ is composed of two layers (the visible layer V and the hidden layer H) connected through a set of weights W_ij, and biases ($v_{bias_{i}}$, $h_{bias_{j}}$). RBMs can be used for regression, classification, dimensionality reduction, CF, and so forth [19]. Depending on the task, they can be trained in a supervised or unsupervised way. A Restricted Boltzmann Machine is illustrated by Fig. 1.

The binary units of the RBM are connected to form a joint configuration (v,h) whose overall energy is defined by (1)

$$ E(v,h;\theta)= {\sum\limits_{i}^{n}}{\sum\limits_{j}^{m}}W_{ij}v_{i}h_{j}-\sum\limits_{i}v_{bias_{i}}v_{i}-\sum\limits_{j}h_{bias_{j}}h_{j} $$

(1)

In a Boltzmann machine (BM), the only restrictions are that all connections are symmetrical, and no unit has a connection to itself. However, due to their high complexity, such networks are much less utilized compared to RBMs, which are BMs in which there are no visible-visible, and hidden-hidden connections [13]. The joint probability distribution is then calculated using (2).

$$ p(v,h;\theta)=\frac{e^{-E(v,h;\theta)}}{\eta(\theta)} $$

(2)

Where $\eta (\theta )={\sum }_{v}{\sum }_{h}e^{-E(v,h;\theta )}$ defines the appropriate normalization factor. The probability that a model assigns states of hidden units to the vector h is computed by marginalizing out the visible vector v as presented in (3).

$$ p(h;\theta)=\sum\limits_{v}p(v,h;\theta)=\frac{1}{\eta(\theta)}\sum\limits_{v} e^{-E(v,h;\theta)} $$

(3)

To get a sample generated by the network according to the probability p(v), the Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) technique, is used [13]. Yet, to obtain samples that are faithful to the learned distribution, it is theoretically necessary to iterate between the states of the Markov chain until convergence.

Figure 2 depicts the sampling process of hidden units $h^{(t)}\leftarrow p(h=1\mid v^{(t)})$ followed by visible units sampling $v^{(t+1)} \leftarrow p(v = 1 \mid h^{(t)})$.

Training the RBM is thus performed by maximizing the log probability, which consists simply in minimizing the Kullback-Leibler divergence [13] between two distributions that are the target distribution 〈v_ih_j〉⁰ of the learning base, namely the input, and the distribution $ \langle v_{i}h_{j} \rangle ^{\infty }$ modeled by the RBM. The gradient of the log probability with respect to a specific weight is, therefore, computed according to (4).

$$ \frac{\partial log(p(v))}{\partial W_{ij}} = \frac{\partial log(\frac{1}{\eta(\theta)}{\sum}_{h} e^{-E(v,h;\theta)})}{\partial W_{ij}} $$

$$ \frac{\partial log(p(v))}{\partial W_{ij}} = \langle v_{i}h_{j} \rangle^{0}-<v_{i}h_{j} \rangle^{\infty} $$

(4)

3.2 Deep belief network

The Deep belief Network is a probabilistic model mainly composed of stacked RBMs. The training of the DBN is performed through a greedy layer-by-layer learning algorithm [14]. A first unsupervised pre-training phase is achieved successively in each layer before finally employing a supervised optimization on the initialized parameters resulting from the first stage. When trained to a set of unsupervised examples, the stacked RBMs act as feature detectors to probabilistically reconstruct input data. The chief purpose of this unsupervised step is to match the inputs and outputs of successive hidden layers. This reconstruction criterion thus ensures that most of the information is preserved after training all the layers. The model parameters at layer Z, along with conditional probabilities of the hidden units, are used as necessary generated data to train the model’s parameters at layer Z + 1. The model can be further optimized with supervision to perform fine-tuning.

3.3 Kernel-based methods

The use of kernel techniques is driven by the possibility of projecting data values to another space of a higher dimension where the linear separation is achievable [7]. The performance of such approaches crucially depends on the appropriate choice of the kernel function. The latter is a symmetric, continuous, and positive function corresponding to a scalar product in a higher dimension space. Formally, for each kernel K, there exists a function $ \phi : \mathcal {X} \subset \mathbb {R}^{n} \rightarrow {\mathscr{H}}$ such that:

$$ K(x,x^{\prime}) = \langle\phi(x),\phi(x^{\prime})\rangle_{\mathcal{H}} \quad \forall x,x^{\prime} \in \mathcal{X} $$

(5)

with $\mathcal {X}$ the original space and ${\mathscr{H}}$ the Hilbert space of dimension greater than $\mathcal {X}$. The kernel functions examined in this work are presented in Table 1. Along with the cost C, each kernel optimizes a specific parameter (i.e., γ,β, and P), thus allowing an efficient performance, making it possible to benefit from simple and rigorously guaranteed techniques while dealing with non-linear problems.

Table 1 Kernel functions

Full size table

In contrast to the linear kernel, which performs linear boundaries in a higher dimension, the polynomial kernel expands m input features to m^P features with degree-P > 0 of polynomials by performing feature conjunctions that are valuable for cases where input data is normalized [16]. This kernel also optimizes β, a real weighting parameter that trades off the impact of lower polynomials. The Radial Basis Function (RBF) is a Gaussian kernel independent of the nature of the observed data, which makes it more suitable for cases when there is no prior information about training data [45]. Here, γ controls the variance of the distribution; hence finding the optimal value of γ yields better performance of the kernel. Unlike other statistical similarity metrics, the cosine kernel measures the similarity as a normalized dot product of two vectors in a high dimensional space by abstracting out their magnitude, which is beneficial in cases where the size of vectors influences the similarity computation, primarily when the Euclidean distance is employed [42]. Otherwise, employing a sigmoid kernel, the learning algorithm becomes similar to a two-layer NN, less prone to a local optimum with great generalization results. A typical value for the slope γ is $\frac {1}{M}$ where M is the dimension of input data [28].

Thanks to their theoretical foundations, SVMs provide a unique approach less prone to overfitting and advantageous compared to an Artificial Neural Network (ANN) that may not allow robust classification of different samples [10]. Kernel methods have been particularly effective for classification based on SVMs, which search for an optimal margin hyperplane that corresponds to the accurate potential separations among the closest data points belonging to different classes while being located with the maximum distance to observations [5]. The mapping of points to a space of a higher dimension and the appropriate choice of optimization parameters of the kernel function not only decreases the computational cost, mainly when dealing with extensive data, but also enhances the generalization performance.

Kernel methods are also leveraged in KPCA [35], which are a non-linear extension of PCA. The latter determines new independent variables, called Principal Components (PC), thus finding relevant projection spaces for the data by maximizing their projected variance. In other words, extracted information using KPCA is non-linearly related to the input data. In fact, by employing kernel functions, KPCA projects the observations into a new space of higher dimension and then carries out an ordinary PCA. Compared to other non-linear extensions of PCA, for instance, NNs, KPCA is known for its stability and a reduced computational cost [45].

4 Proposed model

This section represents the proposed Kernelized Finetuned Deep Belief Network (KFDBN) model, which comprehends three major components:

KFDBN feature extraction:
- Fine-tuned DBN (FDBN) for missing rating prediction: We first develop a deep generative model of stacked RBMs by leveraging the unsupervised pre-training as an initializer of the deep architecture instead of randomly initializing the model’s parameters. This unsupervised pre-training phase results in a regularization effect that renders the learning process more efficient by establishing a restricted parameter space for further optimization. We then incorporate semi-supervised learning to improve the model’s generalization performance by reaching a lower minimum of the cost function. Once all RBMs have been trained, the initialization phase is complete. The compressed representation of data is used as input to a logistic layer that will be trained with backpropagation (BP) to perform discriminative fine-tuning. Hence, the FDBN allows the extraction of higher-level hidden features of users and items and leads substantially to better prediction results in the reconstructed rating matrix.
- kernel-based feature extraction: This stage provides an application of KPCA to capture further the relevant hidden structure of categorical FDBN output features. Processed latent factors are used to expand Users’ and Items’ Profiles to discover correlation features among co-rated items from users’ perspectives. As a result, each item is characterized by a number of hidden characteristics of users that assessed the item. Then, PCA converts learnt factors into a new set of related unified KFDBN features, retaining only the relevant patterns and meaningful information.
KFDBN item relevancy classification: In this phase, we investigate a hybrid approach to incorporate the extracted KFDBN features and available feature data as input vectors of a kernel-based discriminative classifier aiming to efficiently group items into three classes referring to items utility: (relevant, neutral, and irrelevant).
KFDBN-SVD Top-k recommendation: The driving idea behind this phase is to enhance the ranking performance in MF-based recommendation approaches. We use KFDBN results to infer missing ratings in the original sparse user-item interaction matrix. The positive predictions emanated from the FDBN generative model are merged with ratings of relevant items resulting from the KFDBN classifier component to create a denser imputed matrix for SVD-based Top-k recommendation.

The proposed KFDBN architecture is depicted in Fig. 3. The following sub-sections discuss and examine the influence in an in-depth investigation of each component on the entire proposed model.

4.1 KFDBN feature extraction

4.1.1 FDBN for missing rating prediction

Unsupervised pre-training

In this phase, we aim for an accurate generative reconstruction of sparse input data. This is achieved as long as the model’s parameters are initialized effectively. Therefore, to overcome typical problems of gradient descent-based approaches that can quickly get stuck with local minima [23] and do not allow the efficient training of large dimensions of hidden layers [13]; we join an additional layer of feature detectors to an RBM to learn an effective deep generative model for missing rating prediction. Here, the layer-wise unsupervised training is performed on stacked RBMs to capture higher-level features of input data. To expedite the sampling stage depicted in Fig. 2, the Contrastive Divergence is applied [13]. The primary purpose is to maximize the log-likelihood, i.e., ∂log(p(v)) of the data to learn the weight matrix W = {W_ij}, which guarantees the efficient reconstruction of the sparse input user-item matrix. The weights of the network can be updated according to (6).

$$ {\triangle W_{ij} = \alpha (\langle v_{i}h_{j} \rangle^{0} - \langle v_{i}h_{j} \rangle^{\infty})} $$

(6)

Where 〈v_ih_j〉⁰ is the expectation under the training distribution, $\langle v_{i}h_{j} \rangle ^{\infty }$ denotes the expectation under reconstructed sparse data distribution provided by the RBM, and α is the learning rate. Since the units of the same layer have no connection, the states in a layer only depend on the states of the units in the other layer, hence 〈.〉⁰ can be easily computed. On the other hand, $\langle . \rangle ^{\infty }$ can be obtained by applying the CD algorithm by incorporating two significant points that are:

Instead of starting at a random visible point, i.e., v⁽⁰⁾, CD uses an input example from training data to initialize the Gibbs procedure, i.e., $v^{(0)}\leftarrow v \in \widehat {T}$, which makes it possible to explore the regions near the training examples to modify the surface of the energy function [13].
To ensure a regularization effect by only performing G_S iterations of Gibbs sampling [2].

The algorithm of the Contrastive Divergence with G_S steps is described in Algorithm 1. Equation (7) is used to calculate the hidden binary units, and then the visible states are computed using (8). Output visible units are approximate reconstructions of training data.

$$ p(h_{j}=1\mid v) = \sigma\left( {\sum\limits_{i}^{n}}w_{ij}v_{i} + h_{bias_{j}}\right) $$

(7)

Where $\sigma (s) = \frac {1}{1 + e^{-s}}$ is the sigmoid logistic function.

$$ p(v_{i}=1\mid h) = \sigma\left( {\sum\limits_{j}^{m}}w_{ij}h_{j} + v_{bias_{i}}\right) $$

(8)

By comparing the computed probability with the predefined threshold κ according to (9), h_j is set to 0 if p(h_j = 1∣v) is lower than κ and 1 otherwise. The same principle is applied when it comes to activating or inhibiting the states of visible units.

$$ x_{k} = \begin{cases}0 & p(x_{k}=1\mid y) \langle\kappa \\1 \ & p(x_{k}=1\mid y)\rangle\kappa \end{cases} $$

(9)

Where κ ranges from 0.5 to 1.

When training the RBM with the Contrastive Divergence algorithm, the model’s parameters are updated in each training epoch, and the G_S steps of Gibbs sampling are conducted. We divide the training data $\widehat {T}$ into mini-batch samples with a batch size of B_S data points. Updating the summations only needs to be done once in each batch; hence, a batch of input vectors still requires O(nm) complexity, where n and m are input vector and hidden vector dimensions, respectively. Moreover, the CD algorithm avoids the exponential complexity of summing over all of the values of the visible or hidden units by sampling the model distribution with the probability distribution of the input vector. The calculation of conditional probabilities has a computation complexity of O(nm). Therefore, the overall complexity of the Contrastive Divergence with G_S of Gibbs sampling steps is given by O(B_SG_Snm).

The first RBM of the unsupervised DBN model is trained using the Contrastive Divergence with the v⁽⁰⁾ input sample of training data presented on its visible layer, regarding categorical five-star ratings as five nodes, one for each potential evaluation value. Once trained, for each sample v of the training set, the first RBM associates a configuration h of neurons from the hidden layer. h is constructed by sampling the distributions p(h_j∣v^(t)). The configurations h thus obtained are compressed versions of v. Each of the following RBMs learns about the hidden representations of the previous one. Here, the hidden layer of the first RBM acts as a visible layer for the next RBM. Once all RBMs have been trained in an unsupervised fashion, the compressed version of input data presented in the hidden layer of the last RBM, along with the model’s inner parameter space, defines an initialization point to a supervised fine-tuning algorithm and thus enabling the effectiveness of semi-supervised strategies in achieving optimal generalization results. The unsupervised pre-training stage of the DBN is detailed in Algorithm 2.

Supervised fine-tuning

In the supervised fine-tuning phase, we back-propagate the error to calculate gradients of the loss function to obtain optimal weights and biases that will further enhance the performance of the generative rating prediction. The proposed model is considered fine-tuned if it can find the desired target T associated with the corresponding input data V. This is achieved by minimizing the quadratic cost function MSE defined by (10).

$$ \ell =\frac{1}{2N}\sum\limits_{V}\parallel T-H^{(V,{N_{ls})}}\parallel^{2} $$

(10)

Where N_ls is the number of layers, N is the number of training examples, and H is the vector of output activation states from the NN when using V as input.

The aim of the fine-turning stage, described in Algorithm 3, is to investigate how weight changes and biases affect the reconstruction loss function ℓ by computing the gradients ($\frac {\partial \ell }{\partial W}$, $\frac {\partial \ell }{\partial h_{bias}}$). Thence, an intermediate output error $out^{N_{ls}}$ in the last layer is computed by taking into account the rate of change of ℓ according to output activations: $\nabla _{H}\ell = \frac {\partial \ell }{\partial H^{N_{ls}}}$ and $\sigma ^{\prime }$, the derivative of the activation function. Given that $D^{(Z)} \leftarrow W^{Z} H^{Z-1} + h_{bias}^{Z}$, $\sigma ^{\prime }$ indicates how fast the activation function is changing at $D^{N_{ls}}$. $out^{N_{ls}}$ is measured using the (11). In case of MSE function, $\nabla _{H}\ell = H^{N_{ls}} - T$.

$$ out^{N_{ls}} = \frac{\partial \ell}{\partial H^{N_{ls}}} \sigma^{\prime}(D^{N_{ls}}) $$

(11)

The output error for any layer can be measured by merging (11) with (12). The latter is based on the error of the next layer, which involves a backward propagation of the error through the NN.

$$ out^{Z} = (W^{(Z+1)})^{T}out^{(Z+1)} \cdot \sigma^{\prime}(D^{Z}) $$

(12)

To examine the impact of the weights and biases of any layer Z on ℓ the gradient $\frac {\partial \ell }{\partial W^{Z}}$ and $\frac {\partial \ell }{\partial h_{bias}^{Z}} $ are calculated using (13) and (14) respectively.

$$ \frac{\partial \ell}{\partial W^{Z}} = H^{Z-1} out^{Z} $$

(13)

$$ \frac{\partial \ell}{\partial h_{bias}^{Z}} = out^{Z} $$

(14)

The gradients are further updated using different batches until training examples are entirely fine-tuned. Resulted weights and biases from Algorithm 3 are then employed to optimize the FDBN model for better prediction.

4.1.2 Kernel-based feature extraction

In this stage, we seamlessly incorporate hidden features learned by the generative FDBN model into the kernel-based feature extraction using KPCA. Since visible units of the FDBN model are not just simple nodes with a single entry but have instead been considered five nodes to represent individual evaluations (1–5 rating scale), we accordingly convert each set of five binary latent feature nodes of the last hidden layer of the FDBN into categorical feature data. Then, KPCA maps these non-linear FDBN features $ \{x_{1},x_{2}...,x_{X}\} \in \mathbb {R}^{M}$ to a high-dimensional space by transforming them to linearly separable factors F^′. The mapping projection function ϕ extends data points into a feature space $\mathbb {F}$ of a higher dimension as follows:

$$ \phi : \mathbb{R}^{M} \rightarrow \mathbb{F}, x \rightarrow \phi(x) $$

(15)

Where {ϕ(x₁),ϕ(x₂),......ϕ(x_X)} are assumed to be centered at the origin of $\mathbb {F}$, i.e ${\sum }_{k=1}^{X} \phi (x_{k}) = 0$

Given $\overline {C}$, the covariance matrix in the resulting space $\mathbb {F}$, KPCA satisfies the equation $ \nu \overline {V} = \overline {C} \overline {V} $ for positive eigenvalues ν and eigenvectors $\overline {V} \in \mathbb {F}\setminus \{0\}$. $\overline {C}$ is described by (16).

$$ {\overline{C}=\frac{1}{X}\sum\limits_{j=1}^{X} \phi(x_{j}) \phi(x_{j})^{T}} $$

(16)

One may note that all $\overline {V}$ with ν≠ 0 lie in the subspace generated by {ϕ(x₁),ϕ(x₂),......ϕ(x_X)}. We may thence consider the following equivalent equations for all i = 1,....,X:

$$ {\phi(x_{i})\overline{C}\overline{V}=\nu(\phi(x_{i})\overline{V})} $$

(17)

and that there exist eigenvector coefficients e_u(u = 1,....,X) such that:

$$ {\overline{V}=\sum\limits_{u=1}^{X} e_{u} \phi(x_{u})} $$

(18)

Given a kernel function K, to get e_u, we define the X × X kernel matrix $\overline {K}$ according to (19)

$$ {\overline{K}_{uj} := \langle\phi(x_{u}),\phi(x_{j})\rangle= (K(x_{u},x_{j}))_{uj}} $$

(19)

Then, substituting $\overline {C}$, and $ \overline {V}$ into (17), we arrive at:

$$ {X\nu \overline{K}e= \overline{K}^{2}e} $$

(20)

Which is equivalent to solving $X\nu e= \overline {K}e$ for nonzero ν [35].

It is noteworthy that, by employing kernel functions $K(x,x^{\prime })$ defined in Table 1 dot products of projected vectors ϕ(x) are computed without explicitly carrying out the map ϕ(.). Besides, data points cannot be directly centered in $\mathbb {F}$. Hence, the final definition of the kernel matrix of centered data can be described using the matrix $\overline {K}$ as follows:

$$ { \widetilde{\overline{K}_{uj}} = \overline{K}-1_{X}\overline{K}-\overline{K}1_{X}+1_{X}\overline{K}1_{X}} $$

(21)

Where the matrix $(1_{X})uj = \frac {1}{X}$, further details can be referred in [35].

To extract Principal Components (PC), corresponding to a kernel function K, we first compute the matrix $\overline {K}$, then calculate and normalize $\overline {V}$ in the feature space $\mathbb {F}$. The mapping of a data point sample x is calculated onto $\overline {V}^{i}$ in $\mathbb {F}$, which can be expressed by:

$$ {(K PC)_{i} (x)= \overline{V}^{i}\phi(x)} $$

(22)

Therefore, for the test point x, we can describe the i − th feature representation value as follows:

$$ {F'_{i} = \overline{V}^{i}\phi(x) = \sum\limits_{u=1}^{X} {e_{u}^{i}}K(x,x_{u})} $$

(23)

As shown in Fig. 4, we use output $F^{\prime }$ features to create Users’ Profiles (UPs), which in turn generate novel Items’ Profiles (IPs). Here, we suppose that all users who have rated a particular item have common characteristics. Hence IPs represent hidden features hc in items that may interest the users. In other words, an item i_k is described by merged UPs of users that rated the item, which is beneficial in cold-start situations where items and users are new to the system, and no initial rating or feedback is available [20]. Therefore, dimensionality reduction using PCA is applied to new UP-based items’ profiles to boost computational efficiency and capture important information among items relying on UPs. PCA constructs the transformation matrix M that covert D = {hc₁,hc₂...,hc_D}, the original combined hidden features of IPs in q-dimensional space to new reduced KFDBN features D^′ in a t-dimensional space, where (t ≤ q). The Eigen Decomposition (ED) decomposes the Covariance Matrix $\overline {C} (D.D^{T} )$ into three matrices as follows:

$$ D.D^{T} \rightarrow S.L.S^{T} $$

(24)

Where S is the matrix (q × q) that contains Eigenvectors $\overline {V}$, which represent directions; and L is the diagonal matrix (q × q) with eigenvalues ν that represent data magnitude.

One primary benefit of employing PCA on UP-based items profiles remains in further discovering correlation features among co-rated items from users’ perspectives, capturing thus precise and relevant items that satisfy users’ needs. The impact of resulting KFDBN features on item relevancy classification is examined in-depth through the following sections.

4.2 KFDBN item relevancy classification

To perform item relevancy classification, we emphasise powerful deep feature engineering by incorporating the hidden, latent factors extracted by KFDBN in the classification task, unlike traditional classifiers that apply sallow techniques in the learning process [40, 44]. The purpose is to take advantage of the practical training procedure of the FDBN generative approach by using it as a base pre-model that produces input features, along with KPCA, to improve classification results when facing sparsity issues. The KFDBN model is leveraged to create input vectors for kernel-based Support Vector Classification (SVC) in two prominent cases:

KPCA is applied on features as a pre-processing stage to apprehend other meaningful hidden structures to enhance the item relevancy classification. Let $ \{x_{1},x_{2}...,x_{X}\} \in \mathbb {R}^{M}$ be the training data with X observations of features resulting from the KFDBN model combined with other feature data related to rating and items in a given space $\mathbb {R}$ of dimension M. KPCA first maps X to a high-dimensional space using Equations defined in Section 4.1.2. Therefore, the projection of training data is used as input to the SVC.
KFDBN output features are combined with other features and then directly incorporated to perform the kernel-based semi-supervised classification approach.

Let $\{(y_{p},R_{p})\}_{p=1}^{Y}$ be the training set with input examples $y_{p} \in Y \subset \mathbb {R}^{N}$ and corresponding labels $R_{p} \in \{1,...,N_{c}\}_{p=1}^{Y}$ of N-dimensional feature vectors y_p. Let N_c be the number of possible classes; we use the one-versus-one (OvO) approach, also referred to as pairwise classification [25], to introduce $\frac {N_{c}(N_{c}-1)}{2}$ classifiers, one for each pair of classes. The following binary optimisation equation is solved to train the classifier for the class pair separating the class of index r from that of index $r^{\prime }$.

$$ \begin{array}{@{}rcl@{}} && min_{w^{rr^{\prime}},b^{rr^{\prime}},\xi^{rr^{\prime}}} \frac{1}{2} \parallel w^{rr^{\prime}} \parallel^2 + C \sum\limits_{l} \xi_l^{rr^{\prime}}(w^{rr^{\prime}})^T \end{array} $$

(25a)

$$ \begin{array}{@{}rcl@{}} &\text{subject to } &(w^{rr^{\prime}})^T\phi(y_l)+b^{rr^{\prime}}\geq1-\xi_l^{rr^{\prime}}, \text{ if } R_l=r \end{array} $$

(25b)

$$ \begin{array}{@{}rcl@{}} && (w^{rr^{\prime}})^T\phi(y_l)+b^{rr^{\prime}}\leq-1+\xi_l^{rr^{\prime}}, \text{ if } R_l=r^{\prime}\ \end{array} $$

(25c)

$$ \begin{array}{@{}rcl@{}} && \xi_l^{rr^{\prime}} \geq 0 \end{array} $$

(25d)

Where $w^{rr^{\prime }}$ is a vector of dimension L (L ≥ N), C is the penalty parameter, ξ_l are the slack variables, and b denotes the bias term. The decision function for classes r and $r^{\prime }$ is defined according to (26).

$$ \mathscr{F}_{r,r^{\prime}}(y)= (w^{rr^{\prime}})^{T}\phi(y)+b^{rr^{\prime}} $$

(26)

Here, ϕ(y) are mapped vectors in a feature space of dimension L and ${\mathscr{F}}_{r,r^{\prime }}(y)= - {\mathscr{F}}_{r^{\prime },r}(y)$. The voting strategy [15] is used to classify a new sample. Specifically, for the input instance y, we compute

$$ \mathscr{V}_{r}(y)= \sum\limits_{r^{\prime}\neq r, r^{\prime}=1}^{N_{c}} sign(\mathscr{F}_{rr^{\prime}}(y)) $$

(27)

Where

$$ sign(t)=\begin{cases}1, & \text{if } t> 0\\0, & \text{otherwise}\end{cases} $$

(28)

Namely, the binary classifier gives 1 as its vote to ${\mathscr{V}}_{r}(y)$ when the instance y is allocated to class r rather than class $r^{\prime }$. Otherwise, it votes 0. ${\mathscr{V}}_{r}(y))$ is hence the sum of all contributed votes obtained from the classifiers determining that y is assigned to the class r.

$$ r^{*}(y)= \operatorname*{arg max}_{r=1,...,N_{c}}\mathscr{V}_{r}(y) $$

(29)

Therefore, as described by (29), the sample y is classified into the relevancy class (relevant, neutral, and irrelevant) that receives the highest number of votes from the classifiers.

The employed multi-support vector classification using the OvO technique is beneficial for kernel methods since it solves a two-class binary classification instead of using the entire training data N_c times. The proposed classification approach allows thus robust and efficient classification of both small and large data. Moreover, the KFDBN classifier leverages a quadratic optimization that does not suffer from local optimum and overcomes the overfitting problem by employing regularization principles, which expedites the learning process naturally.

4.3 KFDBN-SVD Top-k recommendation

Now that the KFDBN model has generated missing rating predictions using its FDBN generative model and has classified the relevant items using the classifier component, we exploit the KFDBN results to propose a new imputation technique for MF-based recommendation approaches. Specifically, we merge high FDBN’s output predictions with ratings of relevant items to create a denser user-item interaction matrix. Then we apply SVD to the imputed rating matrix to generate Top-k recommendation lists of ranked items for the users.

The regularized SVD decomposes the rating matrix $R \in \mathbb {R}^{n \times m}$ into two low-rank matrices, such that users are associated with feature vectors a_i of $A \in \mathbb {R}^{n \times p}$, and items are described by their latent features b_j of $B \in \mathbb {R}^{p \times m}$, where the rank p << min(n,m). Hence, the matrix R can be estimated using:

$$ R \approx A^{T} B $$

(30)

The rating record r_ij is factorized into user and item latent feature vectors by minimizing the following squared loss function:

$$ min_{A,B}\frac{1}{2}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}I_{ij}(r_{ij}-{a_{i}^{T}}b_{j})^{2}+\frac{\leftthreetimes_{A}}{2}\mid\mid A{\mid\mid_{F}^{2}}+\frac{\leftthreetimes_{B}}{2}\mid\mid B{\mid\mid_{F}^{2}} $$

(31)

Since most inputs in R are missing due to the sparsity, the indicator function I_ij takes 1 if the user i has evaluated the item j and 0 otherwise. The regularization terms ($\leftthreetimes _{A} = \frac {{\sigma _{A}^{2}}}{\sigma ^{2}},\leftthreetimes _{B} = \frac {{\sigma _{B}^{2}}}{\sigma ^{2}}$) are incorporated into the objective function to make a trade-off between overfitting and the variance σ in approximating available entries. The optimum minimization with respect to the Frobenius norm ∣∣.∣∣_F can be obtained by employing gradient descent in A and B.

Therefore, the low-rank linear approximation of the KFDBN-based imputed rating matrix ImpR can be estimated by solving the following optimization:

$$ min_{A,B} \triangle_{\omega} +\triangle_{\psi} +\frac{\leftthreetimes_{A}}{2} \mid\mid A{\mid\mid_{F}^{2}}+\frac{\leftthreetimes_{B}}{2} \mid\mid B{\mid\mid_{F}^{2}} $$

(32)

Where $\triangle _{\omega } = \frac {1}{2}{\sum }_{i=1}^{n}{\sum }_{j=1}^{m}I_{\omega }(i,j)(r_{ij}-{a^{T}_{i}}b_{j})^{2} $ and $\triangle _{\psi } = \frac {1}{2}{\sum }_{i=1}^{n}{\sum }_{j=1}^{m}I_{\psi }(i,j)(r_{ij}^{*}-{a^{T}_{i}}b_{j})^{2} $. I_ω(i,j) is the indicator function that takes 1 if the entry r_ij is available and 0 otherwise. I_ψ(i,j) is the observed indicator which equals to 1 if the imputed rating $r_{ij}^{*}$ exists and 0 otherwise. ω and ψ are respectively the sets of known entries and KFDBN-based imputed ratings. Combining △_ω and △_ψ, the aim is to minimize the following equation:

$$ min_{A,B} \frac{1}{2}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}I(i,j)(\overbrace{r_{ij}}-{a^{T}_{i}}b_{j})^{2}+\frac{\leftthreetimes_{A}}{2} \mid\mid A{\mid\mid_{F}^{2}}+\frac{\leftthreetimes_{B}}{2} \mid\mid B{\mid\mid_{F}^{2}} $$

(33)

Where $\overbrace {r_{ij}}=\begin {cases}r_{ij} & (i,j) \in \omega \\r_{ij}^{*} & (i,j) \in \psi \end {cases}$. Updating latent matrices of ImpR for users and items is performed as follows:

$$ \begin{array}{@{}rcl@{}} & a_i= a_i + \alpha_A(\iota_{ij}b_j-\leftthreetimes_Aa_i) & \end{array} $$

(34a)

$$ \begin{array}{@{}rcl@{}} & b_j= b_j + \alpha_B(\iota_{ij}a_i-\leftthreetimes_Bb_j) \end{array} $$

(34b)

Where α_A, α_B are the learning rates and $\iota _{ij} = \overbrace {r_{ij}} - {a_{i}^{T}}b_{j}$ is the prediction error.

Finally, based on the extracted correlation matrices, we generate the Top-k item recommendations for users. The proposed KFDBN-based imputation approach is particularly useful in solving the sparsity hurdle in SVD while leveraging its advantages from complexity reduction to performance improvement [30].

5 Experiments and evaluation

To fully evaluate the effectiveness of the proposed KFDBN model, we perform extensive experiments on two main recommendation tasks: missing rating prediction and Top-k ranking recommendation. We first present each task’s experimental setup, including evaluation metrics, baseline approaches, and parameter settings. Next, we analyze and discuss the experimental findings, aiming to answer the following Research Questions (RQs):

1.
RQ1. What is the performance of the RBM model in rating prediction?
2.
RQ2. Does going deeper using the RBM leads to better accuracy?
3.
RQ3. Does the FDBN outperform the state-of-the-art approaches in missing rating prediction?
4.
RQ4. Do KFDBN features yield better accuracy in the KFDBN classifier?
5.
RQ5. What is the impact of kernel Dimensionality Reduction on KFDBN classifier performance?
6.
RQ6. Does the KFDBN-based imputation technique outperform the state-of-the-art approaches in the Top-k recommendation?

The experiments are conducted using six real datasets resulting from various domain fields and characterized by different sparsity levels. Table 2 summarizes the statistics of each data set. One may note that experiments are performed both on large-scale and small datasets to examine the effectiveness of the proposed approach in different data sizes. The MovieLens dataset, which was collected by GroupLens,^{Footnote 1} over several periods, describes rating transactions from the MovieLens website. FilmTrust^{Footnote 2} was crawled from a website where users provide feedback ratings on recommended movies. Amazon Digital Music, Automotive, and Instant Video reflect users’ interests for items of several categories and industries in the Amazon platform.^{Footnote 3} The sparsity reported in Table 2 is computed using the following formula: $1-\frac {N_{R}}{N_{U} \times N_{I}}$. Where N_R, N_U, and N_I are the number of ratings, users, and items.

Table 2 Description of evaluation datasets

Full size table

We investigate the sensitivity of the chief parameters and settings of the proposed model and benchmark approaches for each recommendation task. Table 3 presents these hyper-parameters, their definitions, and their impact on the performance of solutions.

Table 3 Hyper-parameters of the proposed KFDBN and benchmark methods

Full size table

5.1 Evaluation for missing rating prediction

5.1.1 Experiment settings

Evaluation metrics

We use two widely employed metrics, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), to calculate the average error in the rating predictions. MAE and RMSE indicate, as presented in (35) and (36), the extent to which the proposed model can generate relevant predictions close to the ratings provided by users.

$$ MAE = \frac{1}{\mid{N\mid}}{\sum}_{n_{k}\in N}\mid\widehat{R}(n_{k}) - {R}(n_{k})\mid $$

(35)

Due to the linearity of MAE, the average variations are weighted equally, whereas the RMSE measure emphasizes the most substantial deviations by squaring the errors before averaging them.

$$ RMSE =\scriptstyle \sqrt{\frac{1}{\mid{N\mid}}{\sum}_{n_{k}\in N}(\widehat{R}(n_{k}) - {R}(n_{k}))^{2}} $$

(36)

Where N, $\widehat {R}(n_{k})$, and R(n_k) are respectively the number of assessed items, the predicted rating and the ground truth rating.

Baselines

To assess the performance of the proposed generative model in missing rating prediction, we compare it with the following state-the-art methods on different recommendation domains:

Non-Negative Matrix factorization (NMF) [12]: MF method with a non-negative probabilistic constraint to select similar user groups and explain CF recommendations.
Neural Collaborative Filtering (NCF) [11]: A state-of-the-art NN-based approach that models hidden user-item interaction using a standard MLP.
SVD++ [26]: SVD extension that integrates implicit records into the recommendation model to uncover latent factors that characterize dense ratings.
A hybrid Deep Learning-based Recommender System (DNNREC) [24]: DNN-based approach that learns non-linear latent features using embedding and auxiliary information about users and items.
Joint Representation Learning with Ratings and Reviews (JRLRR) [43]: A deep hybrid method that fuses rating embedding and textual item features based on Gated Recurrent Unit and attention mechanisms for prediction.
Multi-Objective Evolutionary Algorithm (MOEA) [32]: A multi-objective method to find the appropriate set of similar neighbors for an active user in CF recommendation.

Parameter settings

To train the RBM, we use m_o of 0.1 for the six datasets. We set α as 0.0001 for Digital Music (DG), Instant Video (AIV), and automotive (Auto), while α is set as 0.0005 for both MovieLens 100K (ml-100k) and MovieLens 1M (ml-1m), and as 0.09 for Filmtrust data set. The B_S of 32 performed well on all datasets. Several experiments are set up to examine the hyper-parameters validity and select their optimal values for the FDBN model. Based on the validation outcomes, D is set to 2000 latent features for the first hidden layer and 1000 latent features for the second hidden layer, while the number of propagated training samples B_S is set as 32 on all datasets. α is set as 0.9 on all datasets except AIV, where α = 0.2 shows better performance. With unsupervised pre-training, m_o is set as 0.1 on all datasets to prevent the model from getting stuck in local minima. For benchmark models, the parameters are determined from the proclaimed ones and by a grid-search, which optimizes the approaches to determine the hyper-parameters that result in the most precise recommendation. The optimal values for each method according to recommendation domains and datasets are reported in Tables 4 and 5.

Table 4 Hyper-parameters of benchmark models in movie recommendation domain: Rating Prediction task

Full size table

Table 5 Hyper-parameters of benchmark models in music, video, and automotive recommendation domains: Rating Prediction task

Full size table

5.1.2 The performance of the RBM model (RQ1)

Figures 5 and 6 show the sensitivity of the dimension of hidden features on the accuracy of the RBM model. The RMSE results of the RBM model are presented in Fig. 5, while the MAE results are reported in Fig. 6. The best accuracy on all datasets is achieved when setting D to 5000. By increasing the dimension of latent features, RBM leads to better performance, which indicates that the model can extract more reliable information.

Figure 7 depicts the time complexity of all datasets in both the training and testing phases. It has been observed that the use of a smaller batch size leads to better accuracy results but slows down the convergence of the learning process. Likewise, a lower learning rate allows the model to find a more optimal set of weights but takes longer to train.

5.1.3 The performance of the DBN model (RQ2)

To illustrate the performance of the DBN model before and after fine-tuning, Table 6 reports RMSE and MAE results on the six datasets. We can observe that the RBM model outperforms better than DBN before fine-tuning on four out of the six datasets: ml-1m, Filmtrust, AIV, and Auto. While the DBN before fine-tuning outperforms better than RBM on ml-100k and DG. Nevertheless, after fine-tuning the DBN model, its performance leads to more accurate predictive results.

Table 6 Comparison of prediction accuracy of the RBM, DBN, and FDBN (fine-tuned DBN)

Full size table

The best values of RMSE and MAE on all datasets are achieved using the FDBN, which indicates the significance of the supervised fine-tuning phase. Through the BP, we have conducted the fine-tuning procedure to get an optimal model. The goal here is not to discover new latent features but to search locally for relevant parameters using the hidden features. For further accuracy, the optimization method aims at updating the weights, as depicted in Fig. 8. The weights distribution of the RBM-0 is presented using Matplotlib, while the distribution of the weights of the RBM-1 is presented using Tensorboard. These new adjusted weights result from the fine-tuning stage using BP, which leverages the valuable data in the labels. The model’s accuracy is hence increased after seeking the optimal values of the weights of the RBMs, which produced the efficient FDBN generative model.

5.1.4 Overall performance of the FDBN model in missing rating prediction (RQ3)

We compare the prediction errors of the FDBN and benchmark models using the six datasets. Table 7 presents RMSE and MAE error values in the movie recommendation field, while Table 8 reports RMSEs and MAEs of the methods in the music, video, and automotive recommendation domains.

Table 7 FDBN performance compared with benchmark approaches in the movie recommendation field

Full size table

Table 8 FDBN performance compared with benchmark approaches in the music, video, and automotive recommendation fields

Full size table

The FDBN model immensely outperforms all the state-of-the-art approaches using the six real datasets in the various recommendation fields. Specifically, on ml-100k, ml-1m, and Filmtrust, the RMSE values of the FDBN are, respectively, 0.6382, 0.6266, and 0.6989; which are 22.6%, 24.84 %, and 10.61% lower in error compared to the most competitive solution, i.e., DNNREC. The FDBN also outperforms DNNREC in MAE with 17.74%, 19.14 %, and 9.59% on ml-100k, ml-1m, and Filmtrust, respectively. DNNREC employs side information as input to model users’ and items’ feature vectors, operating different embedding layers. However, the learned latent factors are simply merged and fed to MLP for missing rating prediction. Each feature vector is directly exploited for representation, which can overfit the training data. This fails to ensure an accurate recommendation, especially in cases where the model needs numerous content data to learn significant hidden factors. Nevertheless, the efficient greedy layer-wise training of FDBN not only yields relevant rating predictions without relying on auxiliary or implicit information but also enhances the generalization to new data, thereby avoiding the overfitting problem that may occur in DNNREC.

On Amazon datasets, SVD++ and JRLRR show good results but fail to surpass the FDBN in RMSE and MAE. On Amazon DG, the RMSE value of FDBN is 0.6673, which is 22.1% lower than the second-best solution, i.e., JRLRR. The latter relies substantially on auxiliary data, which is not always effortlessly obtained, especially if privacy constraints remain. Nevertheless, the FDBN copes with co-rated items limitation by incorporating all available interactions of users with items in the learning process without relying on side information. On the AIV dataset, the FDBN achieves an RMSE of 0.6789 and outperforms the MF-based SVD++ technique with 26.3% improvement. This demonstrates that the generative FDBN is better than MF-based techniques (NMF, SVD++) that may lose valuable information from the original matrix. However, the FDBN learns relevant latent factors and valuable hidden features based on probabilistic building blocks, where neural units merely communicate their activation states.

Unlike NCF, which employs simple MLP for feature learning, FDBN leverages a more powerful deep generative model to capture higher-level latent feature interaction in the sparse user-item matrix. The FDBN significantly outperforms neighborhood-based techniques (e.g., MOEA) due to its effective learning process of complex hidden features. MOEA uses Non-dominated Sorting Genetic Algorithm II (NSGA-II) to determine the user’s ideal number of similar neighbors. The genetic technique is based on crowded distance comparison that does not consider the distribution of decisions, thus resulting in a biased solution, which tends to fall into a local optimum and poor convergence [39]. However, the FDBN’s greedy unsupervised pre-training provides a regularization effect by efficiently initializing chief parameters. This enhances the generalization performance and directs the model’s weights toward a good local minimum, indicating why the FDBN vastly exceeds baseline approaches and justifies the best accuracy values in RMSE and MAE errors.