1 Introduction

The expanding demand for personalization has paved the way for the emergence of Recommender Systems (RSs) to defeat the tyranny of prevailing data overwhelm by predicting users’ intentions. The efficiency of such a system is generally determined by how effectively diverse users’ interests are identified [1, 17], how accurately users’ transactions on items are represented [17, 32], and how significant user and item features are extracted [17]. Collaborative filtering (CF) is a fast-evolving technique [18, 36] that relies on feedback scores to infer similar users and items. The Netflix prize competition, which began in 2006, has led to several recent advances in this area and pioneered the research community’s access to movie rating datasets, which drew the interest and attention of many researchers and scientists [4]. However, comprehending precisely varied users’ preferences demands complex models requiring massive information, which runs counter to the sparsity limitation. The latter refers to the situation where available data of users’ interactions with items is sparse [17], making it challenging for the RS to infer accurate predictions.

The continuous upsurge of better prediction accuracy has led to the use of Deep Learning (DL), a subfield of Machine learning (ML), to propose robust and powerful RSs. The current growth of interest in such deep models in the recommendation field is due to novel methods that showed practical training, thus outperforming shallow techniques, especially in sparse conditions [17, 23, 24, 44]. On the other hand, kernel methods have also gained a keen interest in RSs, due to their meaningful feature extraction and efficient structure discovery in data. Recent approaches based on kernel techniques leverage the kernel trick to explore relevant structures in non-linear empirical data in different contexts, thereby capturing reliable features for better prediction [7, 30], or classification [31, 44], as long as the kernel function is selected accurately.

Several state-of-the-art papers integrate side information into DL-based models as an inevitable choice to propose hybrid solutions for missing rating prediction [22, 24]. The additional information could be items properties [24], textual reviews [43], or simply demographics gathered from users’ profiles [37]. Nevertheless, accurate personalization is not about names, ages, or places, as it does not track users’ volatile tastes and behaviors [17]. Not to mention that such data is hard to collect due to privacy aspects. On the other hand, to generate Top-k recommendations, consisting of ranked and personalized lists of items to users, some papers employ auxiliary information in Matrix Factorization (MF) techniques as regularizations to learn latent features [33, 48]. However, such feature vectors represent linearly user and item interactions, which is notably not useful for heterogeneous real-world data, especially if the user-item matrix is scarce and additional information is unavailable [17].

To address these problems, we introduce, in this paper, the Kernelized Finetuned Deep Belief Network (KFDBN) model to accurately predict missing ratings and generate efficient Top-k recommendations even if only a few ratings of users on items are available. We first learn a deep generative model using a Deep Belief Network (DBN) for missing rating prediction. The model captures higher-level hidden feature representations of users and items by allowing reliable layer-by-layer learning of weights, resulting in better generalization. The deep structure is performed by stacking multiple Restricted Boltzmann Machines (RBMs) to learn the visible layer’s distribution probabilistically, allowing each successive RBM to apprehend meaningful features in the sparse rating matrix. We use the Contrastive Divergence (CD) algorithm [13] to initialize the deep architecture of the generative model in an unsupervised way. Such initialization avoids the typical problems of gradient-based approaches that are time-consuming and have to deal with many layers and parameters throughout the learning process, which might lead to local minima [23]. Next, the compressed data representation and the restricted parameter space establish a starting point for supervised fine-tuning. This optimization ensures further effective generative reconstruction of user-item interactions yielding accurate predictions and personalized recommendations.

We also present the efficient KFDBN classifier that emphasizes extracting powerful hidden features of users and items, unlike classification-based CF approaches, which merely employ shallow feature extraction. We demonstrate that KFDBN can be successfully applied to enhance classification performance for different datasets by using the deep generative approach as a base pre-model that produces, along with Kernel Principal Component Analysis (KPCA) [35], meaningful latent feature vectors to conduct an efficient two-step kernel-based discriminative semi-supervised classification.

Another crucial contribution of this paper is to develop a novel imputation technique to solve sparsity issues in MF-based recommendations. We use KFDBN as a hybrid imputation technique to infer missing entries in the original sparse rating matrix. The positive predictions emanated from the generative model are merged with ratings of relevant items resulting from the KFDBN classifier component to create a denser user-item rating matrix for the Top-k recommendation. This leads to solving the optimization issues that occur when MF-based methods are applied directly to sparse data since the error function considers only available inputs. On the other hand, the proposed unified model discovers the underlying local and global non-linear correlations on sparse data to infer missing user-item entries, which significantly differs from state-of-the-art MF-based recommendation approaches [11, 34] that treat unobserved ratings as negative inputs compensating for the skewed distribution of known interactions.

The contributions of this paper are summarized as follows:

  • We propose a novel hybrid model named Kernelized Finetuned Deep Belief Network (KFDBN) that combines the practical training of the DBN with kernel methods’ relevant exploration of data structures to overcome sparsity issues in RSs. We first present the effective fine-tuned (FDBN) model for accurate missing rating prediction.

  • We show that going deeper helps to capture reliable features by comparing the performance of RBM with DBN.

  • We demonstrate that supervised fine-tuning of the generative model yields better prediction results.

  • We present a practical hybrid approach to seamlessly incorporate features extracted by the generative model into kernel-based feature extraction using KPCA to create user profiles. The unified KFDBN feature extraction helps to apprehend important information among items by discovering correlation features among co-rated items from users’ perspectives, capturing thus precise and relevant items that satisfy users’ preferences. The proposed user profile expansion mechanism significantly impacts the performance enhancement of the sparse recommendation.

  • We prove that output features extracted using the hybrid KFDBN lead to higher Support Vector Classification (SVC) outcomes, demonstrating that KFDBN can be successfully extended to effective semi-supervised classification.

  • We develop a novel imputation technique that infers missing ratings in the original sparse user-item interaction matrix to solve sparsity issues in MF-based recommendations.

  • Finally, extensive experimentation on six sparse datasets in diverse domains shows that the FDBN generative model significantly enhances the missing rating prediction and outperforms the state-of-the-art methods by 21.32% and 16.35% in terms of RMSE and MAE, respectively, averaging on all used datasets. We also empirically prove that the KFDBN leads to higher performance for the Top-k recommendation and yields 5.2% and 8.1% improvement in HR@10 and NDCG@10, respectively, averaging on the six datasets.

The remainder of this paper is divided into five sections. Section 2 provides an overview of related work in DL-based feature learning and kernel-based methods applied in RSs. The foundations and preliminaries of RBM, DBN, and Kernel methods are presented in Section 3. Section 4 is dedicated to thoroughly describing the proposed KFDBN and discussing the different components. Section 5 reports experimental results, answers the research questions addressed in this paper and discusses the findings. Finally, the paper conclusions and future direction of the work are presented in Section 6.

2 Related work

The related work of the paper covers two chief points: DL-based models for enhanced feature extraction and kernel-based technique applied in RSs.

2.1 Enhanced-feature extraction based on deep networks

Autoencoders (AEs) are widely employed in RSs for their appropriate latent feature representation. Bathla et al. [3] uses a shared layer in an AE to incorporate indirect social trust with ratings and learn accurately from linked representations. Hybrid AE-based models include users’ side information to address data scarcity [46]. However, users’ latent information is typically diverse and multimodal and may contain improper aspects for prediction. As a result, emphasizing such features by the AE will cause overfitting. Not to mention that reconstruction outputs of AE-based approaches are sometimes directly leveraged as predictions of unknown rating values [23], which may impair the effectiveness of the recommendation.

CapsMF [22] leverages Bi-directional Recurrent Neural networks (RNN) with Capsule Networks for textual analysis of reviews to generate recommendations, thus enhancing the capabilities of Convolutional Neural Networks (CNN) and RNN-based models in text modeling and representation. On the other hand, Capsule networks need to be fine-tuned since they take time to be trained, especially for large-scale recommendations.

Generative Adversarial Networks (GAN) are employed to learn user and item representations [9, 29] and perform rating augmentation [6] to alleviate the sparsity in CF. However, GAN-based models tend to predict high-value ratings for most missing entries in the user-item matrix [6]. Hence, if proposed approaches adopt such a biased augmented matrix for generating recommendations, they are likely to result in poor performance as they fail to apprehend users’ actual interests. Furthermore, such models’ discriminator sometimes leads to a high percentage of contradicting labels for the same items due to early and misleading convergence [29]. Additionally, inverting in a wholly trained GAN-based system is a tedious task.

2.2 Kernel-based methods for recommendation

Though linear methods are simpler to interpret and have shown promising prediction results [38], they may not be proper to train the non-linearly separable and heterogeneous real-world data [17], which promoted the application of kernel methods in RSs. Chen et al. [7] proposes kernel-based matrix completion for a neighborhood-based recommendation. Approximation for missing ratings is enhanced by integrating the Gaussian kernel function with the similarity measure. The technique combines various characteristics in non-linear latent feature spaces by employing a multi-kernel framework, adjusting the kernels’ weights to generate relevant recommendations.

The Kernel Context Recommender System (KCR) employs the kernel trick to build a model based on a context rating matrix for prediction using contextual information [21]. The research shows that variation in kernel functions leads to higher performance than standard context-aware RSs. However, Principal Component Analysis (PCA) could be further applied to filter relevant context features to enhance the efficiency of the kernel mapping framework. Besides, investigating a DL-based model can learn from diverse representations from varied contexts, thus improving the quality of recommendations.

To transcend the idea of incorporating mapped features of similar users directly into classifiers to generate recommendations, [31] used the Multiple Kernel Learning for feature separation; then incorporated output results in the Adaptive Neuro-Fuzzy Inference System (ANFIS) for classification. The proposed model has shown relevant sensitivity and specificity. However, it has not been evaluated in scarce conditions. Furthermore, it is relevant solely for heart disease recommendations. Another paper [44] combined the Scale-Invariant Feature Transform (SIFT) with the Faster R-CNN algorithm and Support vector machine (SVM) classification for music recommendation based on DL and IoT architecture. First, SIFT is used to locally extract features based on the Gaussian convolution kernel. Then, Fast-RCNN is utilized to extract underlying multi-scale features of the scene images. Finally, middle-layer characteristics with spatial information are acquired using SVM classification to develop the background music system. Otherwise, Singular Value Decomposition (SVD) may be leveraged to infer low-dimensional vectors of users’ data [30]; after that, a kernel function can discover relevant latent features between users before predicting ratings using neural-based models such as Multilayer perceptron (MLP) [30].

3 Background preliminaries

3.1 Restricted Boltzmann Machine

A Restricted Boltzmann Machine is a stochastic generative Neural Network (NN) that learns a probability distribution over its set of inputs [13]. The weighted bipartite graph of the RBM, \(\theta =(W_{ij},v_{bias_{i}},h_{bias_{j}})\) is composed of two layers (the visible layer V and the hidden layer H) connected through a set of weights Wij, and biases (\(v_{bias_{i}}\), \(h_{bias_{j}}\)). RBMs can be used for regression, classification, dimensionality reduction, CF, and so forth [19]. Depending on the task, they can be trained in a supervised or unsupervised way. A Restricted Boltzmann Machine is illustrated by Fig. 1.

Fig. 1
figure 1

Restricted Boltzmann Machine

The binary units of the RBM are connected to form a joint configuration (v,h) whose overall energy is defined by (1)

$$ E(v,h;\theta)= {\sum\limits_{i}^{n}}{\sum\limits_{j}^{m}}W_{ij}v_{i}h_{j}-\sum\limits_{i}v_{bias_{i}}v_{i}-\sum\limits_{j}h_{bias_{j}}h_{j} $$
(1)

In a Boltzmann machine (BM), the only restrictions are that all connections are symmetrical, and no unit has a connection to itself. However, due to their high complexity, such networks are much less utilized compared to RBMs, which are BMs in which there are no visible-visible, and hidden-hidden connections [13]. The joint probability distribution is then calculated using (2).

$$ p(v,h;\theta)=\frac{e^{-E(v,h;\theta)}}{\eta(\theta)} $$
(2)

Where \(\eta (\theta )={\sum }_{v}{\sum }_{h}e^{-E(v,h;\theta )}\) defines the appropriate normalization factor. The probability that a model assigns states of hidden units to the vector h is computed by marginalizing out the visible vector v as presented in (3).

$$ p(h;\theta)=\sum\limits_{v}p(v,h;\theta)=\frac{1}{\eta(\theta)}\sum\limits_{v} e^{-E(v,h;\theta)} $$
(3)

To get a sample generated by the network according to the probability p(v), the Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) technique, is used [13]. Yet, to obtain samples that are faithful to the learned distribution, it is theoretically necessary to iterate between the states of the Markov chain until convergence.

Figure 2 depicts the sampling process of hidden units \(h^{(t)}\leftarrow p(h=1\mid v^{(t)})\) followed by visible units sampling \(v^{(t+1)} \leftarrow p(v = 1 \mid h^{(t)})\).

Fig. 2
figure 2

Gibbs sampling iterative process

Training the RBM is thus performed by maximizing the log probability, which consists simply in minimizing the Kullback-Leibler divergence [13] between two distributions that are the target distribution 〈vihj0 of the learning base, namely the input, and the distribution \( \langle v_{i}h_{j} \rangle ^{\infty }\) modeled by the RBM. The gradient of the log probability with respect to a specific weight is, therefore, computed according to (4).

$$ \frac{\partial log(p(v))}{\partial W_{ij}} = \frac{\partial log(\frac{1}{\eta(\theta)}{\sum}_{h} e^{-E(v,h;\theta)})}{\partial W_{ij}} $$
$$ \frac{\partial log(p(v))}{\partial W_{ij}} = \langle v_{i}h_{j} \rangle^{0}-<v_{i}h_{j} \rangle^{\infty} $$
(4)

3.2 Deep belief network

The Deep belief Network is a probabilistic model mainly composed of stacked RBMs. The training of the DBN is performed through a greedy layer-by-layer learning algorithm [14]. A first unsupervised pre-training phase is achieved successively in each layer before finally employing a supervised optimization on the initialized parameters resulting from the first stage. When trained to a set of unsupervised examples, the stacked RBMs act as feature detectors to probabilistically reconstruct input data. The chief purpose of this unsupervised step is to match the inputs and outputs of successive hidden layers. This reconstruction criterion thus ensures that most of the information is preserved after training all the layers. The model parameters at layer Z, along with conditional probabilities of the hidden units, are used as necessary generated data to train the model’s parameters at layer Z + 1. The model can be further optimized with supervision to perform fine-tuning.

3.3 Kernel-based methods

The use of kernel techniques is driven by the possibility of projecting data values to another space of a higher dimension where the linear separation is achievable [7]. The performance of such approaches crucially depends on the appropriate choice of the kernel function. The latter is a symmetric, continuous, and positive function corresponding to a scalar product in a higher dimension space. Formally, for each kernel K, there exists a function \( \phi : \mathcal {X} \subset \mathbb {R}^{n} \rightarrow {\mathscr{H}}\) such that:

$$ K(x,x^{\prime}) = \langle\phi(x),\phi(x^{\prime})\rangle_{\mathcal{H}} \quad \forall x,x^{\prime} \in \mathcal{X} $$
(5)

with \(\mathcal {X}\) the original space and \({\mathscr{H}}\) the Hilbert space of dimension greater than \(\mathcal {X}\). The kernel functions examined in this work are presented in Table 1. Along with the cost C, each kernel optimizes a specific parameter (i.e., γ,β, and P), thus allowing an efficient performance, making it possible to benefit from simple and rigorously guaranteed techniques while dealing with non-linear problems.

Table 1 Kernel functions

In contrast to the linear kernel, which performs linear boundaries in a higher dimension, the polynomial kernel expands m input features to mP features with degree-P > 0 of polynomials by performing feature conjunctions that are valuable for cases where input data is normalized [16]. This kernel also optimizes β, a real weighting parameter that trades off the impact of lower polynomials. The Radial Basis Function (RBF) is a Gaussian kernel independent of the nature of the observed data, which makes it more suitable for cases when there is no prior information about training data [45]. Here, γ controls the variance of the distribution; hence finding the optimal value of γ yields better performance of the kernel. Unlike other statistical similarity metrics, the cosine kernel measures the similarity as a normalized dot product of two vectors in a high dimensional space by abstracting out their magnitude, which is beneficial in cases where the size of vectors influences the similarity computation, primarily when the Euclidean distance is employed [42]. Otherwise, employing a sigmoid kernel, the learning algorithm becomes similar to a two-layer NN, less prone to a local optimum with great generalization results. A typical value for the slope γ is \(\frac {1}{M}\) where M is the dimension of input data [28].

Thanks to their theoretical foundations, SVMs provide a unique approach less prone to overfitting and advantageous compared to an Artificial Neural Network (ANN) that may not allow robust classification of different samples [10]. Kernel methods have been particularly effective for classification based on SVMs, which search for an optimal margin hyperplane that corresponds to the accurate potential separations among the closest data points belonging to different classes while being located with the maximum distance to observations [5]. The mapping of points to a space of a higher dimension and the appropriate choice of optimization parameters of the kernel function not only decreases the computational cost, mainly when dealing with extensive data, but also enhances the generalization performance.

Kernel methods are also leveraged in KPCA [35], which are a non-linear extension of PCA. The latter determines new independent variables, called Principal Components (PC), thus finding relevant projection spaces for the data by maximizing their projected variance. In other words, extracted information using KPCA is non-linearly related to the input data. In fact, by employing kernel functions, KPCA projects the observations into a new space of higher dimension and then carries out an ordinary PCA. Compared to other non-linear extensions of PCA, for instance, NNs, KPCA is known for its stability and a reduced computational cost [45].

4 Proposed model

This section represents the proposed Kernelized Finetuned Deep Belief Network (KFDBN) model, which comprehends three major components:

  • KFDBN feature extraction:

    • Fine-tuned DBN (FDBN) for missing rating prediction: We first develop a deep generative model of stacked RBMs by leveraging the unsupervised pre-training as an initializer of the deep architecture instead of randomly initializing the model’s parameters. This unsupervised pre-training phase results in a regularization effect that renders the learning process more efficient by establishing a restricted parameter space for further optimization. We then incorporate semi-supervised learning to improve the model’s generalization performance by reaching a lower minimum of the cost function. Once all RBMs have been trained, the initialization phase is complete. The compressed representation of data is used as input to a logistic layer that will be trained with backpropagation (BP) to perform discriminative fine-tuning. Hence, the FDBN allows the extraction of higher-level hidden features of users and items and leads substantially to better prediction results in the reconstructed rating matrix.

    • kernel-based feature extraction: This stage provides an application of KPCA to capture further the relevant hidden structure of categorical FDBN output features. Processed latent factors are used to expand Users’ and Items’ Profiles to discover correlation features among co-rated items from users’ perspectives. As a result, each item is characterized by a number of hidden characteristics of users that assessed the item. Then, PCA converts learnt factors into a new set of related unified KFDBN features, retaining only the relevant patterns and meaningful information.

  • KFDBN item relevancy classification: In this phase, we investigate a hybrid approach to incorporate the extracted KFDBN features and available feature data as input vectors of a kernel-based discriminative classifier aiming to efficiently group items into three classes referring to items utility: (relevant, neutral, and irrelevant).

  • KFDBN-SVD Top-k recommendation: The driving idea behind this phase is to enhance the ranking performance in MF-based recommendation approaches. We use KFDBN results to infer missing ratings in the original sparse user-item interaction matrix. The positive predictions emanated from the FDBN generative model are merged with ratings of relevant items resulting from the KFDBN classifier component to create a denser imputed matrix for SVD-based Top-k recommendation.

The proposed KFDBN architecture is depicted in Fig. 3. The following sub-sections discuss and examine the influence in an in-depth investigation of each component on the entire proposed model.

Fig. 3
figure 3

The architecture of the proposed KFDBN model

4.1 KFDBN feature extraction

4.1.1 FDBN for missing rating prediction

Unsupervised pre-training

In this phase, we aim for an accurate generative reconstruction of sparse input data. This is achieved as long as the model’s parameters are initialized effectively. Therefore, to overcome typical problems of gradient descent-based approaches that can quickly get stuck with local minima [23] and do not allow the efficient training of large dimensions of hidden layers [13]; we join an additional layer of feature detectors to an RBM to learn an effective deep generative model for missing rating prediction. Here, the layer-wise unsupervised training is performed on stacked RBMs to capture higher-level features of input data. To expedite the sampling stage depicted in Fig. 2, the Contrastive Divergence is applied [13]. The primary purpose is to maximize the log-likelihood, i.e., log(p(v)) of the data to learn the weight matrix W = {Wij}, which guarantees the efficient reconstruction of the sparse input user-item matrix. The weights of the network can be updated according to (6).

$$ {\triangle W_{ij} = \alpha (\langle v_{i}h_{j} \rangle^{0} - \langle v_{i}h_{j} \rangle^{\infty})} $$
(6)

Where 〈vihj0 is the expectation under the training distribution, \(\langle v_{i}h_{j} \rangle ^{\infty }\) denotes the expectation under reconstructed sparse data distribution provided by the RBM, and α is the learning rate. Since the units of the same layer have no connection, the states in a layer only depend on the states of the units in the other layer, hence 〈.〉0 can be easily computed. On the other hand, \(\langle . \rangle ^{\infty }\) can be obtained by applying the CD algorithm by incorporating two significant points that are:

  • Instead of starting at a random visible point, i.e., v(0), CD uses an input example from training data to initialize the Gibbs procedure, i.e., \(v^{(0)}\leftarrow v \in \widehat {T}\), which makes it possible to explore the regions near the training examples to modify the surface of the energy function [13].

  • To ensure a regularization effect by only performing GS iterations of Gibbs sampling [2].

The algorithm of the Contrastive Divergence with GS steps is described in Algorithm 1. Equation (7) is used to calculate the hidden binary units, and then the visible states are computed using (8). Output visible units are approximate reconstructions of training data.

$$ p(h_{j}=1\mid v) = \sigma\left( {\sum\limits_{i}^{n}}w_{ij}v_{i} + h_{bias_{j}}\right) $$
(7)

Where \(\sigma (s) = \frac {1}{1 + e^{-s}}\) is the sigmoid logistic function.

$$ p(v_{i}=1\mid h) = \sigma\left( {\sum\limits_{j}^{m}}w_{ij}h_{j} + v_{bias_{i}}\right) $$
(8)
Algorithm 1
figure a

Contrastive Divergence(\(\widehat {T}\), α, W, vbias, hbias, mo, GS).

By comparing the computed probability with the predefined threshold κ according to (9), hj is set to 0 if p(hj = 1∣v) is lower than κ and 1 otherwise. The same principle is applied when it comes to activating or inhibiting the states of visible units.

$$ x_{k} = \begin{cases}0 & p(x_{k}=1\mid y) \langle\kappa \\1 \ & p(x_{k}=1\mid y)\rangle\kappa \end{cases} $$
(9)

Where κ ranges from 0.5 to 1.

When training the RBM with the Contrastive Divergence algorithm, the model’s parameters are updated in each training epoch, and the GS steps of Gibbs sampling are conducted. We divide the training data \(\widehat {T}\) into mini-batch samples with a batch size of BS data points. Updating the summations only needs to be done once in each batch; hence, a batch of input vectors still requires O(nm) complexity, where n and m are input vector and hidden vector dimensions, respectively. Moreover, the CD algorithm avoids the exponential complexity of summing over all of the values of the visible or hidden units by sampling the model distribution with the probability distribution of the input vector. The calculation of conditional probabilities has a computation complexity of O(nm). Therefore, the overall complexity of the Contrastive Divergence with GS of Gibbs sampling steps is given by O(BSGSnm).

The first RBM of the unsupervised DBN model is trained using the Contrastive Divergence with the v(0) input sample of training data presented on its visible layer, regarding categorical five-star ratings as five nodes, one for each potential evaluation value. Once trained, for each sample v of the training set, the first RBM associates a configuration h of neurons from the hidden layer. h is constructed by sampling the distributions p(hjv(t)). The configurations h thus obtained are compressed versions of v. Each of the following RBMs learns about the hidden representations of the previous one. Here, the hidden layer of the first RBM acts as a visible layer for the next RBM. Once all RBMs have been trained in an unsupervised fashion, the compressed version of input data presented in the hidden layer of the last RBM, along with the model’s inner parameter space, defines an initialization point to a supervised fine-tuning algorithm and thus enabling the effectiveness of semi-supervised strategies in achieving optimal generalization results. The unsupervised pre-training stage of the DBN is detailed in Algorithm 2.

Algorithm 2
figure b

Unsupervised pre-training(\(\widehat {T}\), Nls, αpr, W, hbias, mo, GS).

Supervised fine-tuning

In the supervised fine-tuning phase, we back-propagate the error to calculate gradients of the loss function to obtain optimal weights and biases that will further enhance the performance of the generative rating prediction. The proposed model is considered fine-tuned if it can find the desired target T associated with the corresponding input data V. This is achieved by minimizing the quadratic cost function MSE defined by (10).

$$ \ell =\frac{1}{2N}\sum\limits_{V}\parallel T-H^{(V,{N_{ls})}}\parallel^{2} $$
(10)

Where Nls is the number of layers, N is the number of training examples, and H is the vector of output activation states from the NN when using V as input.

The aim of the fine-turning stage, described in Algorithm 3, is to investigate how weight changes and biases affect the reconstruction loss function by computing the gradients (\(\frac {\partial \ell }{\partial W}\), \(\frac {\partial \ell }{\partial h_{bias}}\)). Thence, an intermediate output error \(out^{N_{ls}}\) in the last layer is computed by taking into account the rate of change of according to output activations: \(\nabla _{H}\ell = \frac {\partial \ell }{\partial H^{N_{ls}}}\) and \(\sigma ^{\prime }\), the derivative of the activation function. Given that \(D^{(Z)} \leftarrow W^{Z} H^{Z-1} + h_{bias}^{Z}\), \(\sigma ^{\prime }\) indicates how fast the activation function is changing at \(D^{N_{ls}}\). \(out^{N_{ls}}\) is measured using the (11). In case of MSE function, \(\nabla _{H}\ell = H^{N_{ls}} - T\).

$$ out^{N_{ls}} = \frac{\partial \ell}{\partial H^{N_{ls}}} \sigma^{\prime}(D^{N_{ls}}) $$
(11)

The output error for any layer can be measured by merging (11) with (12). The latter is based on the error of the next layer, which involves a backward propagation of the error through the NN.

$$ out^{Z} = (W^{(Z+1)})^{T}out^{(Z+1)} \cdot \sigma^{\prime}(D^{Z}) $$
(12)

To examine the impact of the weights and biases of any layer Z on the gradient \(\frac {\partial \ell }{\partial W^{Z}}\) and \(\frac {\partial \ell }{\partial h_{bias}^{Z}} \) are calculated using (13) and (14) respectively.

$$ \frac{\partial \ell}{\partial W^{Z}} = H^{Z-1} out^{Z} $$
(13)
$$ \frac{\partial \ell}{\partial h_{bias}^{Z}} = out^{Z} $$
(14)

The gradients are further updated using different batches until training examples are entirely fine-tuned. Resulted weights and biases from Algorithm 3 are then employed to optimize the FDBN model for better prediction.

Algorithm 3
figure c

Supervised fine-tuning(\(\widehat {T},L,N_{ls},\alpha _{fn},\ell \)).

4.1.2 Kernel-based feature extraction

In this stage, we seamlessly incorporate hidden features learned by the generative FDBN model into the kernel-based feature extraction using KPCA. Since visible units of the FDBN model are not just simple nodes with a single entry but have instead been considered five nodes to represent individual evaluations (1–5 rating scale), we accordingly convert each set of five binary latent feature nodes of the last hidden layer of the FDBN into categorical feature data. Then, KPCA maps these non-linear FDBN features \( \{x_{1},x_{2}...,x_{X}\} \in \mathbb {R}^{M}\) to a high-dimensional space by transforming them to linearly separable factors F. The mapping projection function ϕ extends data points into a feature space \(\mathbb {F}\) of a higher dimension as follows:

$$ \phi : \mathbb{R}^{M} \rightarrow \mathbb{F}, x \rightarrow \phi(x) $$
(15)

Where {ϕ(x1),ϕ(x2),......ϕ(xX)} are assumed to be centered at the origin of \(\mathbb {F}\), i.e \({\sum }_{k=1}^{X} \phi (x_{k}) = 0\)

Given \(\overline {C}\), the covariance matrix in the resulting space \(\mathbb {F}\), KPCA satisfies the equation \( \nu \overline {V} = \overline {C} \overline {V} \) for positive eigenvalues ν and eigenvectors \(\overline {V} \in \mathbb {F}\setminus \{0\}\). \(\overline {C}\) is described by (16).

$$ {\overline{C}=\frac{1}{X}\sum\limits_{j=1}^{X} \phi(x_{j}) \phi(x_{j})^{T}} $$
(16)

One may note that all \(\overline {V}\) with ν≠ 0 lie in the subspace generated by {ϕ(x1),ϕ(x2),......ϕ(xX)}. We may thence consider the following equivalent equations for all i = 1,....,X:

$$ {\phi(x_{i})\overline{C}\overline{V}=\nu(\phi(x_{i})\overline{V})} $$
(17)

and that there exist eigenvector coefficients eu(u = 1,....,X) such that:

$$ {\overline{V}=\sum\limits_{u=1}^{X} e_{u} \phi(x_{u})} $$
(18)

Given a kernel function K, to get eu, we define the X × X kernel matrix \(\overline {K}\) according to (19)

$$ {\overline{K}_{uj} := \langle\phi(x_{u}),\phi(x_{j})\rangle= (K(x_{u},x_{j}))_{uj}} $$
(19)

Then, substituting \(\overline {C}\), and \( \overline {V}\) into (17), we arrive at:

$$ {X\nu \overline{K}e= \overline{K}^{2}e} $$
(20)

Which is equivalent to solving \(X\nu e= \overline {K}e\) for nonzero ν [35].

It is noteworthy that, by employing kernel functions \(K(x,x^{\prime })\) defined in Table 1 dot products of projected vectors ϕ(x) are computed without explicitly carrying out the map ϕ(.). Besides, data points cannot be directly centered in \(\mathbb {F}\). Hence, the final definition of the kernel matrix of centered data can be described using the matrix \(\overline {K}\) as follows:

$$ { \widetilde{\overline{K}_{uj}} = \overline{K}-1_{X}\overline{K}-\overline{K}1_{X}+1_{X}\overline{K}1_{X}} $$
(21)

Where the matrix \((1_{X})uj = \frac {1}{X}\), further details can be referred in [35].

To extract Principal Components (PC), corresponding to a kernel function K, we first compute the matrix \(\overline {K}\), then calculate and normalize \(\overline {V}\) in the feature space \(\mathbb {F}\). The mapping of a data point sample x is calculated onto \(\overline {V}^{i}\) in \(\mathbb {F}\), which can be expressed by:

$$ {(K PC)_{i} (x)= \overline{V}^{i}\phi(x)} $$
(22)

Therefore, for the test point x, we can describe the ith feature representation value as follows:

$$ {F'_{i} = \overline{V}^{i}\phi(x) = \sum\limits_{u=1}^{X} {e_{u}^{i}}K(x,x_{u})} $$
(23)

As shown in Fig. 4, we use output \(F^{\prime }\) features to create Users’ Profiles (UPs), which in turn generate novel Items’ Profiles (IPs). Here, we suppose that all users who have rated a particular item have common characteristics. Hence IPs represent hidden features hc in items that may interest the users. In other words, an item ik is described by merged UPs of users that rated the item, which is beneficial in cold-start situations where items and users are new to the system, and no initial rating or feedback is available [20]. Therefore, dimensionality reduction using PCA is applied to new UP-based items’ profiles to boost computational efficiency and capture important information among items relying on UPs. PCA constructs the transformation matrix M that covert D = {hc1,hc2...,hcD}, the original combined hidden features of IPs in q-dimensional space to new reduced KFDBN features D in a t-dimensional space, where (tq). The Eigen Decomposition (ED) decomposes the Covariance Matrix \(\overline {C} (D.D^{T} )\) into three matrices as follows:

$$ D.D^{T} \rightarrow S.L.S^{T} $$
(24)

Where S is the matrix (q × q) that contains Eigenvectors \(\overline {V}\), which represent directions; and L is the diagonal matrix (q × q) with eigenvalues ν that represent data magnitude.

Fig. 4
figure 4

Kernel-based feature extraction

Algorithm 4
figure d

PCA(D,q,t).

One primary benefit of employing PCA on UP-based items profiles remains in further discovering correlation features among co-rated items from users’ perspectives, capturing thus precise and relevant items that satisfy users’ needs. The impact of resulting KFDBN features on item relevancy classification is examined in-depth through the following sections.

4.2 KFDBN item relevancy classification

To perform item relevancy classification, we emphasise powerful deep feature engineering by incorporating the hidden, latent factors extracted by KFDBN in the classification task, unlike traditional classifiers that apply sallow techniques in the learning process [40, 44]. The purpose is to take advantage of the practical training procedure of the FDBN generative approach by using it as a base pre-model that produces input features, along with KPCA, to improve classification results when facing sparsity issues. The KFDBN model is leveraged to create input vectors for kernel-based Support Vector Classification (SVC) in two prominent cases:

  • KPCA is applied on features as a pre-processing stage to apprehend other meaningful hidden structures to enhance the item relevancy classification. Let \( \{x_{1},x_{2}...,x_{X}\} \in \mathbb {R}^{M}\) be the training data with X observations of features resulting from the KFDBN model combined with other feature data related to rating and items in a given space \(\mathbb {R}\) of dimension M. KPCA first maps X to a high-dimensional space using Equations defined in Section 4.1.2. Therefore, the projection of training data is used as input to the SVC.

  • KFDBN output features are combined with other features and then directly incorporated to perform the kernel-based semi-supervised classification approach.

Let \(\{(y_{p},R_{p})\}_{p=1}^{Y}\) be the training set with input examples \(y_{p} \in Y \subset \mathbb {R}^{N}\) and corresponding labels \(R_{p} \in \{1,...,N_{c}\}_{p=1}^{Y}\) of N-dimensional feature vectors yp. Let Nc be the number of possible classes; we use the one-versus-one (OvO) approach, also referred to as pairwise classification [25], to introduce \(\frac {N_{c}(N_{c}-1)}{2}\) classifiers, one for each pair of classes. The following binary optimisation equation is solved to train the classifier for the class pair separating the class of index r from that of index \(r^{\prime }\).

$$ \begin{array}{@{}rcl@{}} && min_{w^{rr^{\prime}},b^{rr^{\prime}},\xi^{rr^{\prime}}} \frac{1}{2} \parallel w^{rr^{\prime}} \parallel^2 + C \sum\limits_{l} \xi_l^{rr^{\prime}}(w^{rr^{\prime}})^T \end{array} $$
(25a)
$$ \begin{array}{@{}rcl@{}} &\text{subject to } &(w^{rr^{\prime}})^T\phi(y_l)+b^{rr^{\prime}}\geq1-\xi_l^{rr^{\prime}}, \text{ if } R_l=r \end{array} $$
(25b)
$$ \begin{array}{@{}rcl@{}} && (w^{rr^{\prime}})^T\phi(y_l)+b^{rr^{\prime}}\leq-1+\xi_l^{rr^{\prime}}, \text{ if } R_l=r^{\prime}\ \end{array} $$
(25c)
$$ \begin{array}{@{}rcl@{}} && \xi_l^{rr^{\prime}} \geq 0 \end{array} $$
(25d)

Where \(w^{rr^{\prime }}\) is a vector of dimension L (LN), C is the penalty parameter, ξl are the slack variables, and b denotes the bias term. The decision function for classes r and \(r^{\prime }\) is defined according to (26).

$$ \mathscr{F}_{r,r^{\prime}}(y)= (w^{rr^{\prime}})^{T}\phi(y)+b^{rr^{\prime}} $$
(26)

Here, ϕ(y) are mapped vectors in a feature space of dimension L and \({\mathscr{F}}_{r,r^{\prime }}(y)= - {\mathscr{F}}_{r^{\prime },r}(y)\). The voting strategy [15] is used to classify a new sample. Specifically, for the input instance y, we compute

$$ \mathscr{V}_{r}(y)= \sum\limits_{r^{\prime}\neq r, r^{\prime}=1}^{N_{c}} sign(\mathscr{F}_{rr^{\prime}}(y)) $$
(27)

Where

$$ sign(t)=\begin{cases}1, & \text{if } t> 0\\0, & \text{otherwise}\end{cases} $$
(28)

Namely, the binary classifier gives 1 as its vote to \({\mathscr{V}}_{r}(y)\) when the instance y is allocated to class r rather than class \(r^{\prime }\). Otherwise, it votes 0. \({\mathscr{V}}_{r}(y))\) is hence the sum of all contributed votes obtained from the classifiers determining that y is assigned to the class r.

$$ r^{*}(y)= \operatorname*{arg max}_{r=1,...,N_{c}}\mathscr{V}_{r}(y) $$
(29)

Therefore, as described by (29), the sample y is classified into the relevancy class (relevant, neutral, and irrelevant) that receives the highest number of votes from the classifiers.

The employed multi-support vector classification using the OvO technique is beneficial for kernel methods since it solves a two-class binary classification instead of using the entire training data Nc times. The proposed classification approach allows thus robust and efficient classification of both small and large data. Moreover, the KFDBN classifier leverages a quadratic optimization that does not suffer from local optimum and overcomes the overfitting problem by employing regularization principles, which expedites the learning process naturally.

4.3 KFDBN-SVD Top-k recommendation

Now that the KFDBN model has generated missing rating predictions using its FDBN generative model and has classified the relevant items using the classifier component, we exploit the KFDBN results to propose a new imputation technique for MF-based recommendation approaches. Specifically, we merge high FDBN’s output predictions with ratings of relevant items to create a denser user-item interaction matrix. Then we apply SVD to the imputed rating matrix to generate Top-k recommendation lists of ranked items for the users.

The regularized SVD decomposes the rating matrix \(R \in \mathbb {R}^{n \times m}\) into two low-rank matrices, such that users are associated with feature vectors ai of \(A \in \mathbb {R}^{n \times p}\), and items are described by their latent features bj of \(B \in \mathbb {R}^{p \times m}\), where the rank p << min(n,m). Hence, the matrix R can be estimated using:

$$ R \approx A^{T} B $$
(30)

The rating record rij is factorized into user and item latent feature vectors by minimizing the following squared loss function:

$$ min_{A,B}\frac{1}{2}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}I_{ij}(r_{ij}-{a_{i}^{T}}b_{j})^{2}+\frac{\leftthreetimes_{A}}{2}\mid\mid A{\mid\mid_{F}^{2}}+\frac{\leftthreetimes_{B}}{2}\mid\mid B{\mid\mid_{F}^{2}} $$
(31)

Since most inputs in R are missing due to the sparsity, the indicator function Iij takes 1 if the user i has evaluated the item j and 0 otherwise. The regularization terms (\(\leftthreetimes _{A} = \frac {{\sigma _{A}^{2}}}{\sigma ^{2}},\leftthreetimes _{B} = \frac {{\sigma _{B}^{2}}}{\sigma ^{2}}\)) are incorporated into the objective function to make a trade-off between overfitting and the variance σ in approximating available entries. The optimum minimization with respect to the Frobenius norm ∣∣.∣∣F can be obtained by employing gradient descent in A and B.

Therefore, the low-rank linear approximation of the KFDBN-based imputed rating matrix ImpR can be estimated by solving the following optimization:

$$ min_{A,B} \triangle_{\omega} +\triangle_{\psi} +\frac{\leftthreetimes_{A}}{2} \mid\mid A{\mid\mid_{F}^{2}}+\frac{\leftthreetimes_{B}}{2} \mid\mid B{\mid\mid_{F}^{2}} $$
(32)

Where \(\triangle _{\omega } = \frac {1}{2}{\sum }_{i=1}^{n}{\sum }_{j=1}^{m}I_{\omega }(i,j)(r_{ij}-{a^{T}_{i}}b_{j})^{2} \) and \(\triangle _{\psi } = \frac {1}{2}{\sum }_{i=1}^{n}{\sum }_{j=1}^{m}I_{\psi }(i,j)(r_{ij}^{*}-{a^{T}_{i}}b_{j})^{2} \). Iω(i,j) is the indicator function that takes 1 if the entry rij is available and 0 otherwise. Iψ(i,j) is the observed indicator which equals to 1 if the imputed rating \(r_{ij}^{*}\) exists and 0 otherwise. ω and ψ are respectively the sets of known entries and KFDBN-based imputed ratings. Combining △ω and △ψ, the aim is to minimize the following equation:

$$ min_{A,B} \frac{1}{2}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}I(i,j)(\overbrace{r_{ij}}-{a^{T}_{i}}b_{j})^{2}+\frac{\leftthreetimes_{A}}{2} \mid\mid A{\mid\mid_{F}^{2}}+\frac{\leftthreetimes_{B}}{2} \mid\mid B{\mid\mid_{F}^{2}} $$
(33)

Where \(\overbrace {r_{ij}}=\begin {cases}r_{ij} & (i,j) \in \omega \\r_{ij}^{*} & (i,j) \in \psi \end {cases}\). Updating latent matrices of ImpR for users and items is performed as follows:

$$ \begin{array}{@{}rcl@{}} & a_i= a_i + \alpha_A(\iota_{ij}b_j-\leftthreetimes_Aa_i) & \end{array} $$
(34a)
$$ \begin{array}{@{}rcl@{}} & b_j= b_j + \alpha_B(\iota_{ij}a_i-\leftthreetimes_Bb_j) \end{array} $$
(34b)

Where αA, αB are the learning rates and \(\iota _{ij} = \overbrace {r_{ij}} - {a_{i}^{T}}b_{j}\) is the prediction error.

Finally, based on the extracted correlation matrices, we generate the Top-k item recommendations for users. The proposed KFDBN-based imputation approach is particularly useful in solving the sparsity hurdle in SVD while leveraging its advantages from complexity reduction to performance improvement [30].

5 Experiments and evaluation

To fully evaluate the effectiveness of the proposed KFDBN model, we perform extensive experiments on two main recommendation tasks: missing rating prediction and Top-k ranking recommendation. We first present each task’s experimental setup, including evaluation metrics, baseline approaches, and parameter settings. Next, we analyze and discuss the experimental findings, aiming to answer the following Research Questions (RQs):

  1. 1.

    RQ1. What is the performance of the RBM model in rating prediction?

  2. 2.

    RQ2. Does going deeper using the RBM leads to better accuracy?

  3. 3.

    RQ3. Does the FDBN outperform the state-of-the-art approaches in missing rating prediction?

  4. 4.

    RQ4. Do KFDBN features yield better accuracy in the KFDBN classifier?

  5. 5.

    RQ5. What is the impact of kernel Dimensionality Reduction on KFDBN classifier performance?

  6. 6.

    RQ6. Does the KFDBN-based imputation technique outperform the state-of-the-art approaches in the Top-k recommendation?

The experiments are conducted using six real datasets resulting from various domain fields and characterized by different sparsity levels. Table 2 summarizes the statistics of each data set. One may note that experiments are performed both on large-scale and small datasets to examine the effectiveness of the proposed approach in different data sizes. The MovieLens dataset, which was collected by GroupLens,Footnote 1 over several periods, describes rating transactions from the MovieLens website. FilmTrustFootnote 2 was crawled from a website where users provide feedback ratings on recommended movies. Amazon Digital Music, Automotive, and Instant Video reflect users’ interests for items of several categories and industries in the Amazon platform.Footnote 3 The sparsity reported in Table 2 is computed using the following formula: \(1-\frac {N_{R}}{N_{U} \times N_{I}}\). Where NR, NU, and NI are the number of ratings, users, and items.

Table 2 Description of evaluation datasets

We investigate the sensitivity of the chief parameters and settings of the proposed model and benchmark approaches for each recommendation task. Table 3 presents these hyper-parameters, their definitions, and their impact on the performance of solutions.

Table 3 Hyper-parameters of the proposed KFDBN and benchmark methods

5.1 Evaluation for missing rating prediction

5.1.1 Experiment settings

Evaluation metrics

We use two widely employed metrics, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), to calculate the average error in the rating predictions. MAE and RMSE indicate, as presented in (35) and (36), the extent to which the proposed model can generate relevant predictions close to the ratings provided by users.

$$ MAE = \frac{1}{\mid{N\mid}}{\sum}_{n_{k}\in N}\mid\widehat{R}(n_{k}) - {R}(n_{k})\mid $$
(35)

Due to the linearity of MAE, the average variations are weighted equally, whereas the RMSE measure emphasizes the most substantial deviations by squaring the errors before averaging them.

$$ RMSE =\scriptstyle \sqrt{\frac{1}{\mid{N\mid}}{\sum}_{n_{k}\in N}(\widehat{R}(n_{k}) - {R}(n_{k}))^{2}} $$
(36)

Where N, \(\widehat {R}(n_{k})\), and R(nk) are respectively the number of assessed items, the predicted rating and the ground truth rating.

Baselines

To assess the performance of the proposed generative model in missing rating prediction, we compare it with the following state-the-art methods on different recommendation domains:

  • Non-Negative Matrix factorization (NMF) [12]: MF method with a non-negative probabilistic constraint to select similar user groups and explain CF recommendations.

  • Neural Collaborative Filtering (NCF) [11]: A state-of-the-art NN-based approach that models hidden user-item interaction using a standard MLP.

  • SVD++ [26]: SVD extension that integrates implicit records into the recommendation model to uncover latent factors that characterize dense ratings.

  • A hybrid Deep Learning-based Recommender System (DNNREC) [24]: DNN-based approach that learns non-linear latent features using embedding and auxiliary information about users and items.

  • Joint Representation Learning with Ratings and Reviews (JRLRR) [43]: A deep hybrid method that fuses rating embedding and textual item features based on Gated Recurrent Unit and attention mechanisms for prediction.

  • Multi-Objective Evolutionary Algorithm (MOEA) [32]: A multi-objective method to find the appropriate set of similar neighbors for an active user in CF recommendation.

Parameter settings

To train the RBM, we use mo of 0.1 for the six datasets. We set α as 0.0001 for Digital Music (DG), Instant Video (AIV), and automotive (Auto), while α is set as 0.0005 for both MovieLens 100K (ml-100k) and MovieLens 1M (ml-1m), and as 0.09 for Filmtrust data set. The BS of 32 performed well on all datasets. Several experiments are set up to examine the hyper-parameters validity and select their optimal values for the FDBN model. Based on the validation outcomes, D is set to 2000 latent features for the first hidden layer and 1000 latent features for the second hidden layer, while the number of propagated training samples BS is set as 32 on all datasets. α is set as 0.9 on all datasets except AIV, where α = 0.2 shows better performance. With unsupervised pre-training, mo is set as 0.1 on all datasets to prevent the model from getting stuck in local minima. For benchmark models, the parameters are determined from the proclaimed ones and by a grid-search, which optimizes the approaches to determine the hyper-parameters that result in the most precise recommendation. The optimal values for each method according to recommendation domains and datasets are reported in Tables 4 and 5.

Table 4 Hyper-parameters of benchmark models in movie recommendation domain: Rating Prediction task
Table 5 Hyper-parameters of benchmark models in music, video, and automotive recommendation domains: Rating Prediction task

5.1.2 The performance of the RBM model (RQ1)

Figures 5 and 6 show the sensitivity of the dimension of hidden features on the accuracy of the RBM model. The RMSE results of the RBM model are presented in Fig. 5, while the MAE results are reported in Fig. 6. The best accuracy on all datasets is achieved when setting D to 5000. By increasing the dimension of latent features, RBM leads to better performance, which indicates that the model can extract more reliable information.

Fig. 5
figure 5

RMSE Results of the RBM model (the lowest is better) according to the dimension of latent features

Fig. 6
figure 6

MAE Results of the RBM model (the lowest is better) according to the dimension of latent features

Figure 7 depicts the time complexity of all datasets in both the training and testing phases. It has been observed that the use of a smaller batch size leads to better accuracy results but slows down the convergence of the learning process. Likewise, a lower learning rate allows the model to find a more optimal set of weights but takes longer to train.

Fig. 7
figure 7

Time complexity of the RBM model in both training and testing stages in seconds based on the dimension of latent features

5.1.3 The performance of the DBN model (RQ2)

To illustrate the performance of the DBN model before and after fine-tuning, Table 6 reports RMSE and MAE results on the six datasets. We can observe that the RBM model outperforms better than DBN before fine-tuning on four out of the six datasets: ml-1m, Filmtrust, AIV, and Auto. While the DBN before fine-tuning outperforms better than RBM on ml-100k and DG. Nevertheless, after fine-tuning the DBN model, its performance leads to more accurate predictive results.

Table 6 Comparison of prediction accuracy of the RBM, DBN, and FDBN (fine-tuned DBN)

The best values of RMSE and MAE on all datasets are achieved using the FDBN, which indicates the significance of the supervised fine-tuning phase. Through the BP, we have conducted the fine-tuning procedure to get an optimal model. The goal here is not to discover new latent features but to search locally for relevant parameters using the hidden features. For further accuracy, the optimization method aims at updating the weights, as depicted in Fig. 8. The weights distribution of the RBM-0 is presented using Matplotlib, while the distribution of the weights of the RBM-1 is presented using Tensorboard. These new adjusted weights result from the fine-tuning stage using BP, which leverages the valuable data in the labels. The model’s accuracy is hence increased after seeking the optimal values of the weights of the RBMs, which produced the efficient FDBN generative model.

Fig. 8
figure 8

Weights distribution of the DBN before and after fine-tuning on Movielens 100K

5.1.4 Overall performance of the FDBN model in missing rating prediction (RQ3)

We compare the prediction errors of the FDBN and benchmark models using the six datasets. Table 7 presents RMSE and MAE error values in the movie recommendation field, while Table 8 reports RMSEs and MAEs of the methods in the music, video, and automotive recommendation domains.

Table 7 FDBN performance compared with benchmark approaches in the movie recommendation field
Table 8 FDBN performance compared with benchmark approaches in the music, video, and automotive recommendation fields

The FDBN model immensely outperforms all the state-of-the-art approaches using the six real datasets in the various recommendation fields. Specifically, on ml-100k, ml-1m, and Filmtrust, the RMSE values of the FDBN are, respectively, 0.6382, 0.6266, and 0.6989; which are 22.6%, 24.84 %, and 10.61% lower in error compared to the most competitive solution, i.e., DNNREC. The FDBN also outperforms DNNREC in MAE with 17.74%, 19.14 %, and 9.59% on ml-100k, ml-1m, and Filmtrust, respectively. DNNREC employs side information as input to model users’ and items’ feature vectors, operating different embedding layers. However, the learned latent factors are simply merged and fed to MLP for missing rating prediction. Each feature vector is directly exploited for representation, which can overfit the training data. This fails to ensure an accurate recommendation, especially in cases where the model needs numerous content data to learn significant hidden factors. Nevertheless, the efficient greedy layer-wise training of FDBN not only yields relevant rating predictions without relying on auxiliary or implicit information but also enhances the generalization to new data, thereby avoiding the overfitting problem that may occur in DNNREC.

On Amazon datasets, SVD++ and JRLRR show good results but fail to surpass the FDBN in RMSE and MAE. On Amazon DG, the RMSE value of FDBN is 0.6673, which is 22.1% lower than the second-best solution, i.e., JRLRR. The latter relies substantially on auxiliary data, which is not always effortlessly obtained, especially if privacy constraints remain. Nevertheless, the FDBN copes with co-rated items limitation by incorporating all available interactions of users with items in the learning process without relying on side information. On the AIV dataset, the FDBN achieves an RMSE of 0.6789 and outperforms the MF-based SVD++ technique with 26.3% improvement. This demonstrates that the generative FDBN is better than MF-based techniques (NMF, SVD++) that may lose valuable information from the original matrix. However, the FDBN learns relevant latent factors and valuable hidden features based on probabilistic building blocks, where neural units merely communicate their activation states.

Unlike NCF, which employs simple MLP for feature learning, FDBN leverages a more powerful deep generative model to capture higher-level latent feature interaction in the sparse user-item matrix. The FDBN significantly outperforms neighborhood-based techniques (e.g., MOEA) due to its effective learning process of complex hidden features. MOEA uses Non-dominated Sorting Genetic Algorithm II (NSGA-II) to determine the user’s ideal number of similar neighbors. The genetic technique is based on crowded distance comparison that does not consider the distribution of decisions, thus resulting in a biased solution, which tends to fall into a local optimum and poor convergence [39]. However, the FDBN’s greedy unsupervised pre-training provides a regularization effect by efficiently initializing chief parameters. This enhances the generalization performance and directs the model’s weights toward a good local minimum, indicating why the FDBN vastly exceeds baseline approaches and justifies the best accuracy values in RMSE and MAE errors.

5.2 Evaluation for Top-k ranking recommendation

5.2.1 Experiment settings

Evaluation metrics

We use the Hit Rate (HR) and the Normalized Discounted Cumulative Gain (NDCG) metrics to assess the performance of the proposed KFDBN model in providing an effective ranked list of Top-k items. HR@k examines the presence of the test item in the recommended Top-k list, while the NDCG@k evaluates the ranking’s quality, taking into account the position of the hits. The HR and the NDCG, according to k, the truncated number, are described by (37) and (38).

$$ HR@k = \frac{N\_hits@k}{N\_users} $$
(37)

where N_hits@k is the total number of users for which the exact test item appears in the top-k recommendation and N_users denotes the number of users in the test set.

$$ NDCG@k = \frac{1}{IDCG@k} \times\sum\limits_{i=1}^{k}\frac{2^{rel(i)}-1}{log_{2} (i+1)} $$
(38)

NDCG reduces the scores to hits by log2 at lower positions and increases those at top ranks; rel(i) is the ith rating.

$$ IDCG@k= \sum\limits_{i}^{\mid REL_{k} \mid} \frac{2^{rel(i)}-1}{log_{2} (i+1)} $$
(39)

To provide a normalization factor, the IDCG (Ideal Discounted Cumulative Gain) determines the Discounted Cumulative Gain (DCG) score for the ideal ranking ordered by relevance; ∣RELk∣ is the sorted ideal list of relevant items.

Since the KFDBN-based imputation technique proposed for the Top-k recommendation is a hybrid approach that relies on the results of the KFDBN classifier, we define in (40) the accuracy metric to evaluate the performance of the classification model.

$$ Accuracy = \frac{TP + TN}{TP+TN+FP+FN} $$
(40)

The accuracy measures the proportion of right predictions by the classifier from the total number of individuals in the data set. TP (True Positive) presents an accurate positive prediction, where the prediction is positive, and the real value is indeed positive. TN (True Negative) is the case of a correct negative prediction, where the real value is negative, and the prediction is also negative. While False Positive (FP) indicates an incorrect prediction, where the prediction is positive, but the actual value is negative. FN (False Negative) is the case of a wrong negative prediction, where the prediction is negative, but the true value is positive.

Baselines

We compare the KFDBN with the following competitive approaches on the Top-k ranking task:

  • Bayesian Personalised Ranking (BPR) [34]: MF-based approach that exploits pairwise learning for personalized ranking.

  • Deep Matrix Factorization (DMF) [47]: A DL-based approach that maps non-linear projections of users and items into a common low-dimensional feature space with a binary-based cross-entropy loss function.

  • Neural Matrix Factorization (NeuMF) [11]: A model that fuses generalized matrix factorization (GMF) and MLP. GMF allows the application of a linear kernel that learns latent low-rank features, while the MLP employs a non-linear kernel to extract latent feature interactions.

  • Deep Collaborative Filtering (DeepCF) [8]: It proposes a representation learning based on the fusion of CF models, including the Collaborative Filtering Network (CFNet), for complex matching function modeling while retaining the capability to learn low-rank user-item interactions effectively.

  • Neural Graph Collaborative Filtering (NGCF) [41]: It exploits the high-order connectivity in the bipartite graph of user-item transactions to inject a collaborative signal into an embedding propagation layer.

Parameter settings

For baseline approaches, the optimal values of hyper-parameters on the Top-k recommendation task are reported in Tables 9 and 10.

Table 9 Hyper-parameters of benchmark models in movie recommendation domain: Top-k recommendation task
Table 10 Hyper-parameters of benchmark models in music, video, and automotive recommendation domains: Top-k recommendation task

5.2.2 Impact of KFDBN features on the classification performance (RQ4)

We perform the item relevancy classification using three different types of SVC kernels (RBF, Sigmoid, and Polynomial) with a regularization parameter C = 1. To examine the impact of KFDBN output features on the model performance, we conduct the classification on ml-100k and ml-1m datasets using: (1) Rating features such as the average rating and ratings count combined with item features such as the year and genre. (2) Rating features combined with KFDBN features. (3) Rating features combined with item features and KFDBN features. Classification results in the three cases using the different SVC kernels are reported in Fig. 9.

Fig. 9
figure 9

Classification Accuracy with and without KFDBN features by using different SVC kernels

The obtained classification outcomes show that the RBF kernel outperformed other kernel functions in SVC for all datasets. In comparison, the Sigmoid function provided the lowest accuracies. With a classification accuracy of 97.62%, the SVC with RBF kernel on ml-100k rating features combined with KFDBN features achieved the best performance. It is significant to highlight that combining KFDBN features with rating information led to better accuracy using all SVC kernels on ml-100k, followed by cases where KFDBN features are merged with items and rating features for item relevancy classification. On the other hand, on the ml-1m dataset, the SVC with RBF kernel yielded the highest accuracy of 98.38% when applied on KFDBN features merged with ratings and item features. Moreover, KFDBN features improved the classification results of SVC with other kernels, especially when they are incorporated solely with rating information. However, when combining only item features with rating factors, the classification showed inferior results for both datasets. When PCA was not applied to UP-based Items profiles, the classification accuracy decreased by 17% for ml-100k and 16% for ml-1m. This further proves the significance of the KFDBN model in extracting meaningful latent features that can efficiently train the SVC for a classification problem.

5.2.3 Influence of kernel dimensionality reduction in KFDBN classification (RQ5)

We show the impact of the KPCA employed in the KFDBN item relevancy classifier by comparing the classification results after applying the kernel PC with SVC accuracies, reported in Fig. 9. Classification outcomes before and after the application of KPCA on ml-100k and ml-1m datasets are presented in Figs. 10 and 11, respectively. We have applied KPCA on rating features, item characteristics, and KFDBN latent factors used by the SVC model. We use four kernels for the examination (RBF, Sigmoid, Polynomial, and Cosine).

Fig. 10
figure 10

Classification accuracy of SVC with and without the application of KPCA on trained features: ml-100k dataset

Fig. 11
figure 11

Classification accuracy of SVC with and without the application of KPCA on trained features : ml-1m dataset

For ml-100k, the best accuracy of 98.61% is obtained when applying Sigmoid PCA on rating features combined with KFDBN latent factors used to train the SVC classifier with RBF kernel. Similarly, for ml-1m, the application of Sigmoid PCA on rating features merged with KFDBN and item factors achieved the highest accuracy of 98,83%.

Note that, when KFDBN features are incorporated with training factors, using a Sigmoid PCA leads to higher accuracy of SVC with RBF kernel outperforming, thus other classification results. On the contrary, KPCA with other kernels hampers the performance of SVC with all kernels. Otherwise, when only ratings and item factors are included, reducing the dimensionality of features using KPCA leads to a relatively lower accuracy on SVC based on all kernels for all datasets. This proves the efficiency of Sigmoid PCA in extracting relevant PC of latent features obtained by the KFDBN model, thus enhancing the semi-supervised classification results. Nevertheless, leveraging RBF, Polynomial, and Cosine kernels in PCA adversely affects the performance of SVC classification by changing features drastically due to the ignorance of some dimensions.

5.2.4 Overall performance of the KFDBN model in the Top-k ranking recommendation (RQ6)

The HR and NDCG values of the proposed KFDBN model and the baseline methods on the six datasets are summarized in Tables 11 and 12. The highest score in each column is bold, and the underlined value is the second-best, while △ represents KFDBN’s relative improvement over the best competitive solution. The experimental results demonstrate that the hybrid KFDBN model achieves the most promising ranking performance on all datasets, meaningfully outperforming baseline methods, including DMF, by the largest margin of 17% in NDCG@10 on ml-100k. We believe this is because DMF learns the latent space to capture users’ transactions on items with the inner product directly on the sparse rating matrix, which is inadequate for complex real-world data. In contrast, using a generated denser rating matrix, the KFDBN hybrid model effectively determines the item’s relevance for personalized ranking. Not to mention that the cosine measure employed by DMF may not incorporate the varied rating scales assessed by a particular user [17], further hindering its efficiency in the Top-k recommendation.

Table 11 KFDBN ranking performance compared with benchmark approaches in the movie recommendation field
Table 12 KFDBN ranking performance compared with benchmark approaches in the music, video, and automotive recommendation fields

On the ml-1m dataset, the HR@10 value of KFDBN is 0.7440 and surpasses the second-best approach, i.e., DeepCF (0.7253). The latter merges DMF and MLP for enhanced user-item matching score prediction and reaches the closest results to NeuMF. Nevertheless, despite their efficiency, NeuMF and DeepCF models show inferior results compared to the proposed KFDBN model. In fact, two reasons that hinder the performance are the use of dual embedding spaces in NeuMF that may lead to overfitting and the reliance on simple MLP for feature extraction in DeepCF. However, the KFDBN leverages more powerful DL-based feature interaction learning rather than using MLP as a mapping function to capture complex low-rank features. Furthermore, KFDBN employs a greedy unsupervised pre-training phase to effectively initialize the models’ parameters, which enhances the regularization effect, addresses the gaps in training state-of-the-art DNN approaches, and makes the proposed model less prone to overfitting.

Otherwise, one major limitation with graph-based models (e.g., NGCF) is the potential of exploring the bipartite graph structure of irrelevant user-item transactions, which encodes a poor collaborative signal in the embedding function. Such a limitation explains why KFDBN largely outperforms NGCF in item ranking performance and demonstrates that imputing a denser rating matrix using KFDBN is more accurate in user interest detection than propagating embeddings on the user-item bipartite graph structure.

Besides, DL-based models do not always outperform non-deep methods. Regarding HR and NDCG, BPR performs well on AIV and Auto datasets. However, on AIV, the HR@10 and NDCG@10 values of KFDBN are 0.6571 and 0.4458, which are 6.67% and 5.09% higher than the second-best solution, i.e., BPR. On Auto data set, the KFDBN surpasses the competitive solution, BPR, with 8.47% and 4.67% in HR@10 and NDCG@10, respectively. This demonstrates that KFDBN hybridization exceeds the performance of bayesian ranking in capturing complex correlations of users and items.

Moreover, instead of relying on the standard uniform negative sampler used in BPR to incorporate unavailable interactions as crucial indicators of users’ potential negative entries and suppose unobserved items as negative instances, KFDBN aims for reliable imputation of unobserved ratings in the original sparse matrix and then applies MF on the denser generated matrix to learn significant latent features of users and items, yielding highly encouraging results and justifying KFDBN’s meaningful outperformance over baselines in the Top-k recommendation.

5.3 Implications of findings

The proposed model fully exploits DBN’s advantages to present a deep structure learning architecture for discovering multiple and high-level feature representations for users and items. The unsupervised pre-training tackles traditional random weight initialization, which may draw the model to locally optimal solutions or increase training complexity. The KFDBN results in better generalization by providing a greedy layer-by-layer learning procedure of reliable weights. This stage establishes a starting point for further supervised fine-tuning by placing the model’s parameters in an appropriate restricted range. Leveraging such a semi-supervised synergy yields a regularizing impact that renders the learning strategy more efficient, especially in sparse conditions, which is evident in conducted extensive experimentations that proved the significant improvement of FDBN in terms of RMSE and MAE.

Unlike traditional classification-based CF approaches, which merely employ shallow feature extraction [40, 44], the proposed KFDBN generates deep and meaningful latent representations for users and items, leading to higher SVC-based classification outcomes. The unsupervised procedure of KFDBN indirectly affects the classifier’s performance by providing a reliable initialization of weights; then, the supervised fine-tuning phase helps further optimize the model by searching locally for optimal parameters. Therefore, the SVC can benefit from these stages by adopting output KFDBN features to conduct an efficient semi-supervised classification. Moreover, experimental results proved that KFDBN features could help enhance the accuracy of supervised classifiers, especially if Sigmoid KPCA is further leveraged. Therefore, the sigmoid extraction of features learned by the KFDBN model is a promising new approach that consistently yields an accurate and practical recommendation based on semi-supervised classification.

Some state-of-the-art imputation-based techniques replace missing ratings with plausible values. The column mean or the row mean of the rating matrix may, for instance, be substituted for an empty entry [40]. Such an imputed value is not derived from users’ interests, which alters the data’s covariance structure and thus impeding the system performance. On the other hand, several ranking-based benchmark methods [8, 11, 34] imply a significant modification of the experimental evaluation by treating unobserved ratings as negative instances, thus making up for the skewed distribution of known interaction. Nevertheless, the proposed model exploits a more powerful architecture for imputation to infer unavailable ratings in the original sparse matrix and thus overcome the hindrance of missing values on MF-based ranking performance. We showed that by imploying SVD on a denser matrix imputed using KFDBN, the model captured more significant latent features yielding promising prediction results for the Top-k recommendation. The proposed unified model exploits local and global correlations in the data for imputation. An MF technique can directly be leveraged to make relevant predictions using a reduced rating matrix, making this hybrid approach appropriate for high dimensional and heterogeneous datasets while exhibiting more meaningful accuracy than traditional imputation-based techniques.

One of the significant issues that sparsity is imposing is the inability to generate reliable predictions for long-tail items with only a few ratings [27]. This challenge occurs when the recommendation approach depends on a single latent representation to exploit the ratings concentrated on a few popular items. However, the proposed imputation approach can provide users with personalized and accurate recommendations from the items available in the long tail using hybridization of deep and kernel-based methods to learn latent features of users and items stemming from various aspects. These latent factors comprehensively reflect users’ interests and items’ characteristics for effective missing rating prediction, resulting in better novelty and diversity [17].

6 Conclusion

The proposed KFDBN first leans a deep generative model by leveraging greedy learning layer-by-layer procedure for efficient feature extraction and missing rating prediction. Then, it creates input vectors for kernel-based discriminative support vector classification, performed in two stages to explore the impact of KFDBN output features on kernel-based classification. This work intensively investigates the influence of kernel-based feature extraction on the system’s performance. Experiments prove that further Sigmoid extraction applied to output KFDBN features using Kernel Principal Component Analysis yields accurate and practical recommendations based on semi-supervised classification. Moreover, the positive predictions derived from the KFDBN’s generative model are combined with ratings of relevant items resulting from the KFDBN kernel-based classifier to impute a denser user-item interaction matrix for effective Matrix Factorization-based Top-k recommendation. Empirical evaluations on six datasets with varying sparsity levels demonstrate that the KFDBN achieves higher accuracy outcomes and outperforms the state-of-the-art models. Future perspectives include extending the KFDBN to meta-learning by inferring powerful generalization from meta rating interactions to model the preference learning for cold-start users and items.