1 Introduction

Educational data mining (EDM) can be explained as processes designed for the analysis of data from educational settings for a better understanding of students and the environment, in which they learn in [1]. EDM [2] module tries to understand the learning behavior of the students and tries to predict the status of the student’s performance shortly [3]. The EDM learns the study pattern of the students by analyzing the performance of the students in the past years. The data available in the college records is larger, and hence, the data mining schemes are to be adapted to the big data for building the efficient EDM [4]. The data mining applied to the big data in the EDM recommends various suggestions to the user regarding their academics, future planning, policy framing, etc. [5].

The prediction schemes are related to the information available in the educational institutions, which has the collection of the student’s details, such as personal details, educational details, and other curricular activities. The information from the database can be subjected to the prediction based on the standard data mining schemes, such as classification, clustering, statistics, visualization, etc. Also, the clustering techniques suggested in the EDM only try to gather the information from the data. Predicting the performance of the students from the data available in the universities has faced considerable challenges in the recent years. Organizations use the predicted information for pre-planning the academic activities of the students hence; prediction scheme helps the student to achieve a better result at the end of the course. Some works considered the prediction, as predicting the student’s performance for the upcoming semesters based on the past performance of the student may act as an early warning system for both the learners and the educators [6]. It also provides a learner to design a corrective system or extra coaching to the underperforming students [7, 8].

Literature has suggested the clustering based schemes for the prediction since the application of the clustering schemes in the fields, such as bioinformatics, image analysis, etc., have produced significant results. The clustering scheme divides the database into various groups by calculating the similarity measures between the data points in the database. One of the major challenges involved in the clustering is dividing the database with the vagueness [9]. The literature [10] has used the fuzzy rule-based classifiers (FRBCs) for constructing the prediction system for the students, and [11] has used the particle swarm optimization (PSO) for selecting the suitable centroids for clustering. Moreover, the existing works have suggested the prediction schemes based on the neural networks (NN) [12], decision tree, support vector machine (SVM) [13] and Naive Bayes [14] suitable for the EDM [15]. Several works [16] suggested the use of the academic performance of the students in the high school to predict the performance of the students in the future. This can be studied by the prediction scheme to identify the study pattern of the students.

This research intends to propose the prediction scheme for identifying the student’s performance in the future. The database required for the prediction is collected in the real-time, which comprises of the information, such as individual, environmental, family, etc. The research introduces the cluster based distributed architecture for the predicting the performance of the students. The database may comprise of the missing information hence, the missing data is imputed with the use of the clustering based neural network defined in [17]. The architecture is provided with the set of computers for the processing, and it processes the data clustered through the Bayesian fuzzy clustering (BFC). For every cluster computer, Kernel-based principal component analysis (KPCA) performs the feature dimension reduction and provides it to the training algorithm. Once the best features are identified from the data, the prediction is performed using the proposed Lion–Wolf-deep belief network. Here, Lion–Wolf developed in the previous paper [18] is employed in the deep belief network (DBN) for the training process. Finally, the prediction results are merged in the final cluster nodes to aggregate the prediction results.

This paper contributes towards the EDM and the major beneficiary of this work is listed as follows,

  • Firstly, this work intends to develop the cluster based distributed architecture for predicting the student’s performance.

  • Secondly, the LW-DBN algorithm is developed combining Lion–Wolf algorithm with the DBN, for predicting the student’s performance from the available academic performance.

Further, the paper is organized as follows: Sect. 1 introduces the EDM model and the various techniques used for the prediction of the student’s performance. Section 2 reviews the literary work, which deals with the EDM and the big data. Section 3 presents the proposed cluster-based distributed architecture and the proposed LW-DBN algorithm. The simulation results achieved by the proposed LW-DBN are analyzed in Sect. 4 and conclusion is presented in Sect. 5.

2 Motivation

2.1 Literature survey

This section presents the various literary works dealing with the clustering of the big data involved with the student’s performance.

Kotsiantis [5] presented the prediction model based on the machine learning techniques, and they used the demographic characteristics of the students along with their marks for the training purpose. This prediction model had not considered the missing attributes present in the database hence, the model had failed to leverage the effect of the bias. Another prediction model was proposed by Romero et al. [18] based on the classification algorithms. The database for the training purpose was collected from the quantitative, qualitative and social network environments. The model had improved the prediction but the selected features by the model had not influenced the accuracy of the classifier. Wolff et al. [19] proposed the prediction model with the combined data sources, and the model had included various regression models for designing the prediction model. The model had defined the prediction capacity for the prediction of the student’s performance. Guarín et al. [20] presented the prediction model based on the Naive Bayes and the decision tree classifier suitable for the EDM. The proposed model performed the prediction by the database containing both the academic and the non-academic information.

Chen et al. [21] presented the big data platform for clustering the data from the health records, and the model used the Hadoop framework for the implementation of the clustering model. The model had improved the scalability and the flexibility. However, the cost of implementation of the model is high. Hussain et al. [4] proposed a prediction framework for analyzing the big data involved during the EDM, and the framework tried to predict the student’s performance through the verification strategy. Even though the model provided improved prediction results, it lacked in the integration of various data mining strategies. Bharara et al. [1] had presented a model for learning the context in the database, and they presented the learning model based on the interrelationships between the attributes of the database. K-means clustering model used in the model clustered the database based on the feature set obtained from the learning context. Asif et al. [16] presented the data mining methods to study the performance of undergraduate students. Two aspects of students’ performance have been focused for building the model. The first model found the prediction based on the performance of the students end of a four-year study course. The other model concentrated in the typical progressions of the students or the study pattern, and then, these patterns were combined with the results of the prediction model. The work lacked in the generalization of the results.

Vojt [25] had focused on three different kinds of deep neural networks: (i) the multilayer perceptron, (ii) the convolutional neural network, and (iii) the deep belief network. The comparison results represent the superior performance of multilayer perceptrons and convolutional neural networks. The deep belief network implementation performs slightly better for restricted Boltzmann machine (RBM) layers with up to 1000 hidden neurons. Yazdani and Jolai [26] have proposed a population-based algorithm named as the Lion optimization algorithm (LOA). The results of LOA were compared with some well-known meta-heuristics for benchmark problems. The results showed that the LOA had high performance than the other algorithms.

Mirjalili et al. [27] have proposed a meta-heuristic named grey wolf optimizer (GWO). The GWO algorithm copied the hunting mechanism and leadership hierarchy of grey wolves. It includes the three main steps, namely (i) searching for prey, (ii) encircling prey, and (iii) attacking prey. The results depicted that the GWO algorithm provides competitive results than the well-known meta-heuristics. Montana and Davis [28] have described a set of experiments performed on data from a sonar image classification problem. These experiments demonstrated the improvements obtained by a genetic algorithm rather than backpropagation and the evolution of the performance of the genetic algorithm.

2.2 Challenges

Various challenges involved in the process during the clustering of the big data is described as follows,

  • Digitization of the academic records allowed the universities to store the data on the websites or the electronic form. Also, the universities store the data of a large number of students, and hence, the concept of the big data needs to be incorporated into the data mining schemes while designing EDM [16].

  • The quality of the information available in the academic records mainly influences the prediction schemes. Also, the database may contain missing attributes that severely reduce the accuracy of the prediction [9].

  • In work [20], both the academic and the non-academic information have been used for predicting the student’s performance, but the prediction model requires only the academic records for the prediction. Hence, for this database feature selection requires high attention.

  • The use of the machine learning schemes to develop the prediction model has gained popularity nowadays. But, the features for the training need to be carefully selected [10].

3 Proposed method

This section presents the proposed cluster-based distributed architecture for predicting the student’s performance in the future. Figure 1 presents the diagram for the proposed cluster-based distributed architecture. The proposed prediction scheme contains two merger layers, namely merger layer 1 and merger layer 2, and two distributed layers namely, distributed layer 1 and distributed layer 2. The data for the prediction scheme is collected from the web pages of institutions; the database D is provided to the proposed architecture as different data sources. Each data source provided for the distributed layer 1 contains the information of the students belonging to the individual colleges. The distributed layer 1 imputes the missing data into the data sources, and clusters the data sources as smaller groups. Each merger present in the merger layer 1 combines the clusters and provides it to the distributed layer 2. In the distributed layer 2, the features of the data group from the merger are extracted and provided for the training in the proposed LW-DBN network. The LW-DBN predicts the student’s performance for the forthcoming semesters based on the features. Then, the predicted information is passed over the merger layer 2, and it provides the information about the marks secured by the students in the upcoming semesters.

Fig. 1
figure 1

Block diagram of the proposed cluster-based distributed architecture for predicting the student’s performance

One of the major challenges in constructing the prediction model is the data collection, and the collected data must be genuine and free-from errors. The proposed cluster-based architecture used the academic records of the students from the college websites and the records. The database collected from each college corresponds to student information from the individual college. Let us consider there are \(\left| \hbox {S} \right| \) students available in each college, and the collected data sources are represented as,

$$\begin{aligned} \hbox {D}=\left\{ {\hbox {D}_1,\hbox {D}_2,\ldots ,\hbox {D}_{\mathrm{i}},\ldots ,\hbox {D}_{\mathrm{D}} } \right\} \end{aligned}$$
(1)

where \(\hbox {D}_{\mathrm{i}}\) refers to the data source collected from the \(\hbox {i}\)th college or the institution, and the size of the individual data source varies as \(\left| \hbox {S} \right| ^{*} \hbox {N}\). The term \(\left| \hbox {S} \right| \) indicates the student count in each data source and \(\hbox {N}\) refers to the total attributes available in the data source. The data source contains the information about the personal information, academic records, extracurricular activities, and the schooling records. The proposed cluster-based distributed architecture has \(\hbox {D}\) number of data sources, \(\hbox {M}\) number of nodes in the merger layer 1, \(\hbox {M}\) number of nodes for the feature selection, and \(\hbox {X}\) number of nodes in the merger layer 2, and hence, the total number of cluster nodes in the proposed cluster-based distributed architecture is expressed as,

$$\begin{aligned} A=D+M+M+X \end{aligned}$$
(2)

where, A refers to the total number of cluster nodes in the proposed architecture.

3.1 Construction of the distributed layer 1

The distributed layer 1 forms the primary layer of the proposed prediction scheme, and it has the missing data imputation (MDI) block and the clustering block. Since the \(\hbox {D}\) data sources contain a large number of data, the proposed architecture tries for the simultaneous processing of the data. The distributed layer 1 has \(\hbox {D}\) number of processors for the clustering process.

3.1.1 Missing data imputation

The data sources collected from information sources contains missing data, and hence, this factor needs to be analyzed clearly for predicting the performance of the students. The MDI block present in the proposed cluster-based distributive framework depends on the clustering based neural network defined in [17]. Here, the data present in the data sources are clustered with the WLI fuzzy clustering model, which finds the centroid by averaging the data points. The clustered information is provided to the hybrid neural network, where the GWO is involved in training the weights of the NN. The hybrid NN provides the information about the missing data available in the data source. Finally, the MDI block fills the missing attribute information along with the data source and sends to the next layer.

3.1.2 Bayesian fuzzy clustering

The data source contains large information, which is clustered into smaller groups for the processing. The proposed architecture utilized the BFC [22] scheme for dividing the data source into smaller groups. The BFC model eliminates the manual interpretation of choosing a number of clusters for the data clustering, and hence, the model is more advantageous while clustering large data. The distributed layer 1 has \(\hbox {D}\) number of systems for clustering the data from each MDI block. Consider the data source \(\hbox {D}_{\mathrm{i}}\) from the \(\hbox {i}\)th MDI block is provided to the \(\hbox {i}\)th BFC block for data clustering. The BFC utilizes the membership function \(\hbox {Q}\) to find the suitable cluster prototypes. The BFC uses the uniform symmetric Dirichlet proposal for the clustering and is defined as,

$$\begin{aligned} \hbox {g}_{\mathrm{n}}^{+} \sim \hbox { Dirichlet}\left( {\eta =1_{\mathrm{M}} } \right) \end{aligned}$$
(3)

where \(\hbox {M}\) and \(\hbox {g}_{\mathrm{n}}^{+}\) refer to the number of clusters and the uniform symmetric Dirichlet proposal for the data clustering. The cluster prototypes are defined through the distribution, and it is expressed as,

$$\begin{aligned} \tilde{\mathrm{P}} (\hbox {d}_{\mathrm{n}} ,\hbox {g}_{\mathrm{n}} | \tilde{\mathrm{G}} )= & {} \hbox {P}(\hbox {d}_{\mathrm{n}} |{\hbox {g}_{\mathrm{n}} } ,\hbox {G})\tilde{\mathrm{P}} \left( \hbox {g}_{\mathrm{n}} | \hbox {G} \right) \end{aligned}$$
(4)
$$\begin{aligned} \tilde{\mathrm{P}} (\hbox {d}_{\mathrm{n}} ,\tilde{\mathrm{g}}_{\mathrm{n}} | {\mathrm{G}})= & {} \sum _{\mathrm{m}=1}^{\mathrm{M}} {\mathrm{N}\left( \hbox {d}_{\mathrm{n}} \big | {\hbox {h}_{\mathrm{n}} ,\hbox {g}_{\mathrm{nm}}^{-\mathrm{v}}} \right) } \hbox {g}_{\mathrm{nm}}^{-\mathrm{v}\beta /2} \hbox {Dirichlet}(\hbox {g}_{\mathrm{n}} \left| \eta \right. )\nonumber \\ \end{aligned}$$
(5)

The cluster prototypes are subjected to the variation due to the Gaussian distribution, and hence, the Markov chain state rule is applied. The modified value of the cluster prototype is given as,

$$\begin{aligned} \hbox {P}({\hbox {D}_{\mathrm{i}}, \hbox {h}_{\mathrm{m}} | \hbox {Q}})= \hbox {P}({\hbox {D}_{\mathrm{i}} | \hbox {Q},\hbox {h}_{\mathrm{m}} }) \hbox {P} \left( {\hbox {h}_{\mathrm{m}}} \right) \end{aligned}$$
(6)

where \(\hbox {Q}\) refers to the membership function used for the data clustering. Finally, the clustered information of the data source \(\hbox {D}_{\mathrm{i}}\) from the \(\hbox {i}\)th BFC is represented as,

$$\begin{aligned} \hbox {M}_{\mathrm{i}} =\left\{ {\hbox {M}_{1}^{\mathrm{i}}, \hbox {M}_{2}^{\mathrm{i}},\ldots ,\hbox {M}_{\mathrm{m}}^{\mathrm{i}},\ldots ,\hbox {M}_{\mathrm{M}}^{\mathrm{i}} } \right\} \end{aligned}$$
(7)

where \(\hbox {M}_{\mathrm{i}}\) represents the output of the ith BFC in the distributed layer 1, and \(\hbox {M}\) is the total number of clusters obtained from the ith BFC. Each clustered output from the BFC has the size of \(\left| {J} \right| ^{*} {N}\). Hence, the final output of the distributed layer 1 is the data clusters from each BFC, and it is represented as,

$$\begin{aligned} \hbox {M}=\left\{ {\hbox {M}_1,\hbox {M}_2,\ldots ,\hbox {M}_{\mathrm{i}},\ldots ,\hbox {M}_{\mathrm{D}} } \right\} \end{aligned}$$
(8)

where D is the total number of nodes required for BFC, which is equivalent to the number of data sources.

3.2 Construction of the merger layer 1

The merger layer 1 combines the results of the distributed layer 1. The merger layer 1 has \(\hbox {M}\) number of mergers, represented as \(\left\{ {\hbox {B}_1,\hbox {B}_2,\ldots ,\hbox {B}_{\mathrm{m}},\ldots ,\hbox {B}_{\mathrm{M}} } \right\} \), for collecting the \(\hbox {M}\) clusters from each BFC. The function of the mth merger in the merger layer 1 is to combine the mth cluster group produced by each BFC. The output of the mth merger is defined as,

$$\begin{aligned} \hbox {B}_{\mathrm{m}} =\left\{ {\hbox {M}_{\mathrm{m}}^{1},\hbox {M}_{\mathrm{m}}^{2},\ldots ,\hbox {M}_{\mathrm{m}}^{\mathrm{i}},\ldots ,\hbox {M}_{\mathrm{m}}^{\mathrm{D}} } \right\} \end{aligned}$$
(9)

where \(\hbox {M}_{\mathrm{m}}^{\mathrm{i}}\) indicates the mth cluster group from the ith BFC. The data size of each merger is defined as \(\left| {U} \right| ^{*} {N}\).

3.3 Construction of the distributed layer 2

The distributed layer 2 gets the merged data from the merger layer 1 for the feature selection, followed by the prediction. The distributed layer is provided with M number of feature selector blocks and the predictor blocks. The proposed architecture employs the KPCA model for feature selection and the proposed LW-DBN algorithm for the prediction.

3.3.1 Feature selection: KPCA

It is necessary to select the suitable features from the clustered information. The proposed architecture makes use of the existing KPCA [23] model to select the features from each merger. The distributed layer 2 has \(\hbox {M}\) KPCA feature selector blocks, defined as \(\left\{ {\hbox {F}_1 ,\hbox {F}_2 ,\ldots ,\hbox {F}_{\mathrm{m}} ,\ldots ,\hbox {F}_{\mathrm{M}} } \right\} \). Consider the data group available in the mth merger as \(\hbox {B}_{\mathrm{m}}\), which is of the size \(\left| \hbox {U} \right| ^{*} \hbox {N}\), used as the training data for the feature selection by the KPCA model. Consider the data present in the merger as \(\hbox {B}_{\mathrm{m}} =\left\{ {\hbox {B}_{\mathrm{uq}} } \right\} \), where the value of the \(\hbox {p}\) and \(\hbox {q}\) vary as the size of the data \(\left| \hbox {U} \right| ^{*} \hbox {N}\). The training sample for the feature selection is defined as \(\frac{1}{\left| \hbox {U} \right| }\sum _{\mathrm{u}=1}^{\left| \hbox {U} \right| } {\gamma \left( {\hbox {b}_\mathrm{u} } \right) }\). The KPCA uses the kernel function for selecting the features from the training data, and it is expressed as,

$$\begin{aligned} \hbox {T}\left( {\hbox {b,b}_{\mathrm{u}} } \right) =\gamma ^{\mathrm{T}}(\hbox {b}_{\mathrm{u}} )\gamma (\hbox {b}) \end{aligned}$$
(10)

where the term \(\hbox {T}\) refers to the kernel function. The application of the kernel function to the data yields the covariance matrix defined as,

$$\begin{aligned} \hbox {B}=\frac{1}{\left| \hbox {U} \right| }\sum _{\mathrm{u}=1}^{\left| \hbox {U} \right| } {\gamma ^{\mathrm{T}}(\hbox {b}_{\mathrm{u}} )\gamma (\hbox {b})} \end{aligned}$$
(11)

The selection of the features from the data is considered as the Eigenvalue problem, and it is stated as follows,

$$\begin{aligned} B\cdot V=\lambda \cdot V \end{aligned}$$
(12)

where V indicates the associate Eigenvectors. \(\lambda \) represents the eigenvalue of B. Rearranging the above equation, the required Eigenvectors for the feature selection are obtained, and it is expressed as,

$$\begin{aligned} T\cdot \phi =u\cdot \lambda \cdot \phi \end{aligned}$$
(13)

where \(\phi \) indicates the associative Eigenvector. u indicates the total eigenvalue coefficients. The features extracted from the training data \(\hbox {B}_{\mathrm{m}}\) is represented as,

$$\begin{aligned} \hbox {f}(\hbox {B}_{\mathrm{uq}} )=\sum _{\mathrm{u}=1}^{\left| \mathrm{U} \right| } {\phi _{\mathrm{uq}} \gamma ^{\mathrm{T}}(\hbox {b}_\mathrm{q} )\gamma (\hbox {b}_\mathrm{u})} \end{aligned}$$
(14)

The features form the \(\hbox {m}\)th KPCA is represented by the following equation,

$$\begin{aligned} \hbox {F}_{\mathrm{m}} =\left\{ {\hbox {f}_{\mathrm{m}}^{1} ,\hbox {f}_{\mathrm{m}}^{2} ,\ldots ,\hbox {f}_{\mathrm{m}}^{\mathrm{r}} ,\ldots ,\hbox {f}_{\mathrm{m}}^{\mathrm{R}} } \right\} \end{aligned}$$
(15)

where \(f_{m}^{r}\) represents the rth feature from the mth KPCA and the feature set from each KPCA has the size of \(\left| \hbox {U} \right| ^{*}\hbox {R}\), such that the value of Ris less than the total number of attributes N.

3.3.2 Prediction of student’s performance using the proposed LW-DBN

The features selected from the KPCA are provided to the LW-DBN for the training purpose. The distributed layer has \(\hbox {M}\) number of proposed LW-DBN for predicting the students’ performance. The features selected from each KPCA are fed to the proposed LW-DBN network for the training purpose. Figure 2 presents the architecture of the LW-DBN model containing the two RBM layers and one (multilayer perception) MLP layers.

Fig. 2
figure 2

Architecture of the proposed LW-DBN model

The proposed LW-DBN network has two RBM layers and one MLP layer, and the features from the KPCA are provided to the first RBM layer for training. The neurons corresponding to the visible layer of the first RBM layer corresponds to the input feature size, and the hidden layers of the RBM 1 have the corresponding weights. The input provides to the LW-DBN is represented as,

$$\begin{aligned} \hbox {L}^{1}=\left\{ {\hbox {L}_{1}^{1} ,\hbox {L}_{2}^{1} ,\ldots ,\hbox {L}_{\mathrm{r}}^{1} ,\ldots ,\hbox {L}_{\mathrm{R}}^{1} } \right\} ;\quad 1\le \hbox {r}\le \hbox {R} \end{aligned}$$
(16)

where, \(\hbox {L}_{\mathrm{r}}^{1}\) refers to the feature input to the \(\hbox {r}\)th neuron layer of the RBM layer 1 and R is the total number of neurons in the input layer. The RBM hidden layers in the first RBM layer of the LW-DBN is represented as,

$$\begin{aligned} \hbox {H}^{1}=\left\{ {\hbox {H}_{1}^{1} ,\hbox {H}_{2}^{1} , \ldots ,\hbox {H}_{\mathrm{s}}^{1} ,\ldots ,\hbox {H}_{\mathrm{a}}^{1} } \right\} ;\quad 1\le \hbox {s}\le \hbox {a} \end{aligned}$$
(17)

where, \(\hbox {H}_{\mathrm{s}}^{1} \) refers to the hidden neuron in the first RBM layer. The weights present in the RBM layer of the LW-DBN is represented as,

$$\begin{aligned} \hbox {W}^{1}=\left\{ {\hbox {W}_{\mathrm{rs}}^1 } \right\} \end{aligned}$$
(18)

where, \(\hbox {W}_{\mathrm{rs}}^{1}\) corresponds to the weights between the \(\hbox {r}\)th visible and the \(\hbox {s}\)th hidden layer of the RBM 1. The expression of the output of the RBM layer 1 in the LW-DBN corresponds to the weight and the feature input, and it is expressed as,

$$\begin{aligned} \hbox {K}_{\mathrm{c}}^{1} =\tau \left[ {\hbox {y}_{\mathrm{s}}^{1} +\sum _{\mathrm{s}} {\hbox {L}_{\mathrm{r}}^{1} \hbox {H}_{\mathrm{rs}}^{1}}} \right] \end{aligned}$$
(19)

where, \(\tau \) indicates the activation function and the expression for each output is combined and expressed as,

$$\begin{aligned} \hbox {K}^{1}=\left\{ {\hbox {K}_{\mathrm{s}}^1 } \right\} ; \quad 1\le \hbox {s}\le \hbox {a} \end{aligned}$$
(20)

Similarly, the output of the RBM layer 1 is provided as the input to the RBM layer 2, and the output of the RBM layer two is expressed as \(\hbox {K}^{2}\). The output of the RBM layer two is directly fed to the MLP layer, and it is given as,

$$\begin{aligned} \hbox {I}=\left\{ {\hbox {I}_1 ,\hbox {I}_2 ,\ldots ,\hbox {I}_{\mathrm{s}} ,\ldots ,\hbox {I}_{\mathrm{a}} } \right\} =\left\{ {\hbox {K}_{\mathrm{s}}^2 } \right\} ;\quad 1\le \hbox {s}\le \hbox {a} \end{aligned}$$
(21)

where, \(\hbox {u}_{\mathrm{j}}\) represents the \(\hbox {j}\)th input neuron of the MLP layer, and similarly RBM layer has X number of hidden layers. The weights corresponding to the input and the hidden layer of the MLP layer is expressed below,

$$\begin{aligned} \hbox {w}^{\mathrm{I}\_\mathrm{MLP}}=\left\{ {\hbox {w}_{\mathrm{sx}}^{\mathrm{I}\_\mathrm{MLP}} } \right\} ;\quad 1\le \hbox {s}\le \hbox {a};\quad 1\le \hbox {x}\le \hbox {X} \end{aligned}$$
(22)
$$\begin{aligned} \hbox {w}^{\mathrm{H}\_\mathrm{MLP}}=\left\{ {\hbox {w}_{\mathrm{xc}}^{\mathrm{H}\_\mathrm{MLP}} } \right\} ;\quad 1\le \hbox {x}\le \hbox {X};\quad 1\le \hbox {c}\le \psi \end{aligned}$$
(23)

where, \(\hbox {w}^{\mathrm{I}\_\mathrm{MLP}}\) indicates the weight in the input of the MLP, \(\hbox {w}^{\mathrm{H}\_\mathrm{MLP}}\) indicates the weights in the hidden layer of the MLP, and \(\psi \) is the number of output neurons. The output of the hidden layer present in the MLP depends on the following equation,

$$\begin{aligned} \hbox {e}_{\mathrm{x}} =\left[ {\sum _{\mathrm{s}=1}^{\mathrm{a}} {\hbox {w}_{\mathrm{sx}}^{\hbox {I}\_\mathrm{MLP}} {^{*}}\hbox {I}_{\mathrm{s}} } } \right] \hbox {E}_{\mathrm{X}} \forall \hbox {I}_{\mathrm{s}} =\hbox {K}_{\mathrm{s}}^{2} \end{aligned}$$
(24)

The final output of the proposed LW-DBN is expressed as,

$$\begin{aligned} Z=\sum _{x=1}^X {w_{xc}^{H-MLP} {^{*}}e_x } \end{aligned}$$
(25)

where, \(e_{x}\) is the output of the hidden layer.

A. Training phase of the DBN with the existing LW algorithm

The optimal weights for the RBM and the MLP layers are found with the use of the existing LW algorithm defined in [24]. The LW algorithm is the integration of the LOA and GWO algorithms. In the existing work [24], the LW algorithm is used for the training of the NN, for the better learning of the deep features in the database, this work replaced the NN with the DBN. In the training phase of the DBN [25], the optimal weights of the MLP layers are found with the use of the LW algorithm along with the gradient descent algorithm. The weights involved in the RBM layers are trained with the use of the backpropagation. The weights in the input and the hidden layer of the MLP are trained with the use of the LW and the gradient descent algorithm. The training procedure for finding the optimal weights in the MLP layer is defined as follows,

  1. (1)

    The weights present in the input and the hidden layer of the MLP l are initialized based on Eqs. (22) and (23), respectively.

  2. (2)

    Provide the training input to the MLP layer obtained from the second RBM layer.

  3. (3)

    Find the MLP layer output Z and the bias \(\hbox {e}_{\mathrm{x}}\) based on Eqs. (25) and (24), respectively.

  4. (4)

    The computed output of the MLP layer has some deviation from the ground value. Thus, the error value is expressed as follows,

    $$\begin{aligned} C_{avg} =\frac{1}{x}\sum _{c=1}^X {\left( {Z_c -\psi _c } \right) ^{2}} \end{aligned}$$
    (26)

    where, \(\hbox {C}_{\mathrm{avg}} \)indicates the deviation of the response of the classifier form the actual response. \(Z_c\) indicates the output of cth layer of the LW-DBN and the term \(\psi _{\mathrm{c}}\) indicates the desired response.

  5. (5)

    The error computed from the previous step contributes to the weight adjustment in the input and the hidden layer of the LW-DBN, and it is expressed in the following equations,

    $$\begin{aligned} \Delta \hbox {w}_{\mathrm{sx}}^{{\hbox {I-MLP}}}= & {} -\,\chi \frac{\partial \hbox {C}_{\mathrm{avg}} }{\partial \hbox {w}_{\mathrm{sx}}^{{\hbox {I-MLP}}} } \end{aligned}$$
    (27)
    $$\begin{aligned} \Delta \hbox {w}_{\mathrm{x}}^{{\hbox {H-MLP}}}= & {} -\,\chi \frac{\partial \hbox {C}_{\mathrm{avg}}}{\partial \hbox {w}_{\mathrm{x}}^{{\hbox {H-MLP}}} } \end{aligned}$$
    (28)
  6. (6)

    Update the weight of the input and the hidden layer of MLP based on the gradient descent algorithm, and it is defined as,

    $$\begin{aligned} \hbox {w}_{\mathrm{sx}(\mathrm{Gr})}^{{\hbox {I-MLP}}} \left( {\hbox {t}+1} \right)= & {} \hbox {w}_{\mathrm{sx}}^{{\hbox {I-MLP}}} \left( \hbox {t} \right) +\Delta \hbox {w}_{\mathrm{sx}}^{{\hbox {I-MLP}}} \end{aligned}$$
    (29)
    $$\begin{aligned} \hbox {w}_{\mathrm{x}(\mathrm{Gr})}^{{\hbox {H-MLP}}} \left( {\hbox {t}+1} \right)= & {} \hbox {w}_{\mathrm{x}}^{{\hbox {H-MLP}}} \left( \hbox {t} \right) +\Delta \hbox {w}_{\mathrm{x}}^{{\hbox {H-MLP}}} \end{aligned}$$
    (30)
  7. (7)

    The weight update based on the LW algorithm for both the input and the hidden layer of the MLP is defined as follows,

    $$\begin{aligned}&\hbox {w}_{\mathrm{sx}(\mathrm{LM})}^{{\hbox {I-MLP}}} \left( {\hbox {t}+1} \right) \nonumber \\&\quad =\frac{\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (1)+\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (2)+\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (3)+\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (4)}{4}\nonumber \\ \end{aligned}$$
    (31)

    where, the terms \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (1)\) refer to the position of the first best search agent, \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (2)\), \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (4)\) indicate the position of second best search agent, and \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (3)\) position based on the third search agent for the input layer. The values of the \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (1)\), \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (2)\), and \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (3)\) are obtained from the position update done based on the GWO algorithm, and their expressions are defined as follows,

    $$\begin{aligned} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{\mathrm{I}-\mathrm{MLP}} }\limits ^\rightarrow (1)= & {} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _1 )-\mathop {\hbox {z}_1 }\limits ^\rightarrow \cdot \mathop {\hbox {l}(\omega _1 )}\limits ^\rightarrow \end{aligned}$$
    (32)
    $$\begin{aligned} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (2)= & {} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _2 )-\mathop {\hbox {z}_2 }\limits ^\rightarrow \cdot \mathop {\hbox {l}(\omega _2 )}\limits ^\rightarrow \end{aligned}$$
    (33)
    $$\begin{aligned} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (3)= & {} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _3 )-\mathop {\hbox {z}_3 }\limits ^\rightarrow \cdot \mathop {\hbox {l}(\omega _3 )}\limits ^\rightarrow \end{aligned}$$
    (34)

    where, the value of the \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _1)\), \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _2)\) and \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _3)\) indicates the search position at the direction of \(\omega _1 \), \(\omega _2 \), and \(\omega _3 \) in the 2D space, respectively. The value of \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\omega _1)\) indicates the solution update from the fertility evaluation of the female Lion used in the LOA and the value of the \(\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (4)\) is indicated as follows,

    $$\begin{aligned} \mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (4)=\min \left[ {\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\max ),\max \left[ {\mathop {\hbox {w}_{\mathrm{sx(LM)}}^{{\hbox {I-MLP}}} }\limits ^\rightarrow (\min ),\nabla \hbox {c}} \right] } \right] \nonumber \\ \end{aligned}$$
    (35)

    where, the value of the \(\nabla \hbox {c}\) indicates the female update function used in the LOA. Similarly, the weight update for the hidden layer of the MLP is expressed as follows,

    $$\begin{aligned}&\hbox {w}_{{\mathrm{sx(LM)}}}^{{{\hbox {H-MLP}}}} \left( {\hbox {t}+1} \right) \nonumber \\&\quad =\frac{\mathop {\hbox {w}_{{\mathrm{sx(LM)}}}^{{{\hbox {H-MLP}}}} }\limits ^\rightarrow (1)+\mathop {\hbox {w}_{{\mathrm{sx(LM)}}}^{{{\hbox {H-MLP}}}} }\limits ^\rightarrow (2)+\mathop {\hbox {w}_{{\mathrm{sx(LM)}}}^{{{\hbox {H-MLP}}}} }\limits ^\rightarrow (3)+\mathop {\hbox {w}_{{\mathrm{sx(LM)}}}^{{{\hbox {H-MLP}}}} }\limits ^\rightarrow (4)}{4}\nonumber \\ \end{aligned}$$
    (36)
  8. (8)

    Calculate the output of the algorithm based on the weight update done through the gradient descent algorithm and the calculated error is expressed as \(\hbox {C}_{\mathrm{avg(Gr)}} \)

  9. (9)

    Similarly, compute the output of the algorithm with the weights updated based on the LW algorithm and the defined error is expressed as \(\hbox {C}_{\mathrm{avg(LW)}} \).

  10. (10)

    Finally, the final weight of the input and the hidden layers are calculated based on Eqs. (37) and (38). The weight of the algorithm with the minimal error value replaces the actual weight of the MLP layers.

    $$\begin{aligned}&\hbox {w}_{\mathrm{jx}}{{{\hbox {I-MLP}}}} \left( {t+1} \right) \nonumber \\&\quad =\left\{ {\begin{array}{l} \hbox {w}_{\mathrm{sx(LW)}}{{{\hbox {I-MLP}}}} \left( {\hbox {t}+1} \right) ; \quad \hbox {if C}_{\mathrm{avg(LW)}} <\hbox {C}_{\mathrm{avg(Gr)}} \\ \hbox {w}_{\mathrm{sx(Gr)}}{{{\hbox {I-MLP}}}} \left( {\hbox {t}+1} \right) ; \quad \hbox { otherwise} \\ \end{array}} \right. \nonumber \\\end{aligned}$$
    (37)
    $$\begin{aligned}&\hbox {w}_{\mathrm{jx}}^{{\hbox {H-MLP}}} \left( {\hbox {t}+1} \right) \nonumber \\&\quad =\left\{ {\begin{array}{l} \hbox {w}_{\mathrm{sx(LW)}}^{{\hbox {H-MLP}}} \left( {\hbox {t}+1} \right) ; \quad \hbox { if C}_{\mathrm{avg(LW)}} <\hbox {C}_{\mathrm{avg(Gr}} \\ \hbox {w}_{\mathrm{sx(Gr)}}^{{\hbox {H-MLP}}} \left( {\hbox {t}+1} \right) ; \quad \hbox { otherwise} \\ \end{array}} \right. \end{aligned}$$
    (38)
  11. (11)

    At the end of the iteration, the optimal weights of the MLP layer are returned by the proposed LW-DBN algorithm.

B. Testing of the proposed LW-DBN classifier

The optimal weights from the LW-DBN training are used in the testing phase. For the features of the test input \(\hbox {D}^{\mathrm{test}}\), the LW-DBN classifier finds the marks of students in the next semester.

3.4 Construction of the merger layer 2

The merger layer 2 in the proposed architecture has \(\hbox {X}\) number of mergers, and the merger provides the performance of the students in various semesters with the predicted results. The output \(\hbox {O}\) from the merger present in merger layer 2 has the size of \(\hbox {Y}^{*}1\). The term Y represents a group of the students present in the database, and thus, each merger in the merger layer 2 provides the marks of \(\hbox {Y}\) students in the semester 1. Thus, the \(\hbox {X}\) merger present in the merger layer two present in the marks of \(\hbox {Y}\) students for the \(\hbox {X}\) semesters. The following equation represents the output of the \(\hbox {X}\) mergers present in the merger layer 2.

$$\begin{aligned} \hbox {O}=\left\{ {\hbox {O}_1 ,\hbox {O}_2 ,\ldots ,\hbox {O}_{\mathrm{o}} ,\ldots ,\hbox {O}_{\mathrm{X}} } \right\} \end{aligned}$$
(39)

where, \(\hbox {O}_{\mathrm{o}}\) refers to the output of the \(\hbox {o}\)th merger present in the merger layer 2. The \(\hbox {o}\)th layer of the merger layer 2 combines the results of each \(\hbox {m}\)th output layer of the LW-DBN present in the distributed layer 2. Thus, the first layer of the merger layer 1 contains the semester one marks of \(\hbox {Y}\) students present in the database D. The expression of the output of the \(\hbox {o}\)th layer of the merger layer two is defined as follows,

$$\begin{aligned} O_o =\left\{ {Z_c (1),Z_c (2),\ldots Z_c (m),\ldots ,Z_c (M)} \right\} \end{aligned}$$
(40)

where, the term \(\hbox {Z}_{\mathrm{c}} (\hbox {m})\) refers to the \(\hbox {c}\)th output layer in the \(\hbox {m}\)th LW-DBN network, and its size is \(\hbox {Y}^{*}1\). Finally, each merger provides the individual semester marks of the \(\hbox {Y}\) students in the \((\hbox {t}+1)\)th semester.

Algorithm 1 presents the pseudocode of the proposed cluster based distributed architecture for predicting the student’s performance.

figure f

4 Results and discussion

This section presents simulation results achieved by the proposed LW-DBN classifier for predicting the student’s performance.

4.1 Experimental setup

The experimentation of the proposed work is implemented in the MATLAB tool, and the setup requires the configurations of PC with the Windows 10 OS, Intel I3 processor, and the 4 GB RAM.

4.1.1 Database description

The data for the experimentation is collected in the real time, and it contains information about the student’s details, environment, and performance of the student in school, and the academic record in each semester during the college.

4.1.2 Evaluation metrics

The evaluation of the proposed system is measured based on the metrics, such as MSE, and RMSE, which define the error performance. The mathematical expression for the MSE and the RMSE is expressed as follows,

MSE The MSE metric defines the quality of the prediction system by calculating the deviation of the predicted value of the model from the actual value provided by the database, and it is expressed as follows,

$$\begin{aligned} \hbox {MSE}=\frac{1}{\mathrm{X}}\sum _{\mathrm{o}=1}^{\mathrm{X}} {\left( {\hbox {O}_{\mathrm{o}} -\mathop {\hbox {O}_{\mathrm{o}} }\limits ^\wedge } \right) } ^{2} \end{aligned}$$
(41)

where, \(\hbox {O}_{\mathrm{o}}\) indicates the predicted value and \(\mathop {\hbox {O}_{\mathrm{o}} }\limits ^\wedge \) refers to the ground truth information.

RMSE The RMSE also defines the error performance of the models and is expressed as the root value of the MSE metric.

$$\begin{aligned} \hbox {RMSE}=\sqrt{\mathrm{MSE}} \end{aligned}$$
(42)

Since both the evaluation metrics RMSE and the MSE defines the error performance of the model, lower value of the RMSE and the MSE indicates improved performance.

4.1.3 Comparative models

The experimentation of the proposed cluster-based distributed architecture along with the LW-DBN is compared with the various existing techniques, such as DBN, Lion based DBN (L-DBN), Grey wolf based DBN (W-DBN), and Genetic Algorithm based DBN (GA-DBN). The description of the existing models used for the analysis of the proposed LW-DBN is explained as follows,

DBN [25] The prediction model with the DBN network uses the backpropagation algorithm for training the weights and bias present in the RBM layer and the MLP layer.

L-DBN In the L-DBN, the weights present in the MLP layer of the DBN is trained with the Lion optimization algorithm [26].

W-DBN Similar, to the L-DBN algorithm, the DBN is trained with the grey wolf optimization [27] algorithm for predicting the student’s performance.

GA-DBN In the existing DA-DBN algorithm, the weights involved in the MLP layer of the DBN is trained with the genetic algorithm (GA) [28].

4.2 Comparative analysis of the proposed LW-DBN

The performance of the proposed LW-DBN is compared with the DBN, L-DBN, W-DBN, and the GA by varying the training percentage of the database (k), crossfold validation (cf), iteration (t), step ratio (sr), and the dropout ratio (dr). The results of the proposed LW-DBNare analyzed based on the MSE, and the RMSE metrics.

4.2.1 Comparative analysis of models without the crossfold validation

A. Analysis based on varying the step ratio (sr)

Here, the comparative analysis is done by varying the step ratio of the DBN network and training the database without the crossfold validation. Figure 3a depicts the comparative analysis of the models based on MSE metric for varying \(\hbox {sr}\) values. For \(\hbox {sr}= 0.3\), the existing models DBN, L-DBN, W-DBN, and the GA-DBN achieved the MSE values of 0.337239, 0.238128, 0.559461, 0.536805, respectively, while the proposed LW-DBN algorithm achieved the lowest MSE value of 0.234044. Similarly, for the analysis based on RMSE as shown in Fig. 3b, the existing DBN, L-DBN, W-DBN, and the GA-DBN has the RMSE values of 0.116509, 0.056862, 0.314537, and 0.28957, respectively for \(\hbox {sr}= 0.3\). But, the proposed LW-DBN model has achieved the lower RMSE value of 0.054907 for the \(\hbox {sr}\) value of 0.3.

Fig. 3
figure 3

Comparative analysis of models without the crossfold validation based on a MSE by varying sr, b RMSE by varying sr, c MSE by varying dr, d RMSE by varying dr

B. Analysis based on varying the dropout ratio (dr)

Figure 3c, d presents the comparative analysis of the models based on the varying values of dr for the database trained without crossfold validation. Analysis based on the MSE, as shown in Fig. 3c, shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN metrics achieved the MSE values of 0.374352, 0.228539, 0.251737, and 0.539084 respectively, while the proposed LW-DBN model has the lower MSE value of 0.222606 for \(\hbox {dr}\) the value of 0.3. Likewise, analysis based on the RMSE depicted in Fig. 3d shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN model has the RMSE value of 0.132749, 0.053137, 0.064499, and 0.292894 respectively for the \(\hbox {dr}\) value of 0.3. The proposed LW-DBN algorithm achieved the RMSE value of 0.050436 for the \(\hbox {dr}\) value of 0.3 thus outperforms the other models.

C. Analysis based on varying the iteration (T)

Figure 4a, b presents the comparative analysis of the models based on the varying values of iteration (T) without crossfold validation. Analysis based on the MSE, shown in Figure 4a, shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN metrics achieved the MSE values of 0.53122, 0.237958, 0.563879 and 0.563879, respectively, while the proposed LW-DBN model has the lower MSE value of 0.233653 for \(\hbox {T}=1100\). Similarly, analysis based on the RMSE depicted in Fig. 4b shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN model has the RMSE value of 0.282533, 0.057077, 0.318471, and 0.318471 respectively for the \(\hbox {T}=1100\). The proposed LW-DBN algorithm achieved the RMSE value of 0.054969 for the \(\hbox {T}=1100\) and thus, has better performance over the other models.

Fig. 4
figure 4

Comparative analysis of models without the crossfold validation based on a MSE by varying iteration (T), b RMSE by varying iteration (T), c MSE by varying training % (k), d RMSE by varying training % (k)

D. Analysis based on varying the training % (k)

Analysis based on the MSE, as presented in Fig. 4c, shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN metrics achieved the MSE values of 0.54612, 0.289413, 0.547157 and 0.547157 respectively, while the proposed LW-DBN model has the lower MSE value of 0.240637 for \(\hbox {k}=75\). Similarly, analysis based on the RMSE depicted in Fig. 4d shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN model has the RMSE value of 0.274548, 0.055957, 0.058315, and 0.062019 respectively for the \(\hbox {k}=75\). The proposed LW-DBN algorithm achieved the RMSE value of 0.052342 for the \(\hbox {k}=75\) thus have improved performance over the other models.

4.2.2 Comparative analysis of models with the crossfold validation

A. Analysis based on varying the step ratio (sr)

Here, the comparative analysis is done by varying the step ratio of the DBN network and training the database with the crossfold validation. Figure 5a depicts the comparative analysis of the models based on MSE metric for varying \(\hbox {sr}\) values. For \(\hbox {sr}= 0.3\) and \(\hbox {cf}=6\), the existing models DBN, L-DBN, W-DBN, and the GA-DBN achieved the MSE values of 0.337239, 0.238128, 0.559461, 0.536805, respectively, while the proposed LW-DBN algorithm achieved the lowest MSE value of 0.234044. Similarly, for the analysis based on RMSE as shown in Fig. 5b, the proposed LW-DBN model has achieved the lower RMSE value of 0.054907 for the \(\hbox {sr}\) value of 0.3 than the existing methods.

Fig. 5
figure 5

Comparative analysis of models with the crossfold validation based on a MSE by varying sr, b RMSE by varying sr, c MSE by varying dr, d RMSE by varying dr

Fig. 6
figure 6

Comparative analysis of models with the crossfold validation based on a MSE by varying T, b RMSE by varying T, c MSE by varying k, d RMSE by varying k

B. Analysis based on varying the dropout ratio (dr)

The analysis based on the MSE, as shown in Fig. 5(c), depicts that the existing DBN, L-DBN, W-DBN, and the GA-DBN metrics achieved the MSE values of 0.502074, 0.240744, 0.401396, and 0.237185, respectively, while the proposed LW-DBN model has the lower MSE value of 0.236999 for \(\hbox {dr}= 0.3\) and \(\hbox {cf}=6\). Likewise, analysis based on the RMSE depicted in Fig. 5d shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN model has the RMSE value of 0.254469, 0.058816, 0.185429, and 0.056942, respectively, for the \(\hbox {dr}= 0.3\) and \(\hbox {cf}=6\). The proposed LW-DBN algorithm achieved the RMSE value of 0.056731 for the \(\hbox {dr}= 0.3\) and \(\hbox {cf}=6\) thus outperforms the other models.

C. Analysis based on varying the iteration (T)

Figure 6a, b presents the comparative analysis of the models based on the varying values of iteration (T) with crossfold validation. Analysis based on the MSE, as shown in Fig. 6a, depicts that the existing DBN, L-DBN, W-DBN, and the GA-DBN metrics achieved the MSE values of 0.404957, 0.233036, 0.561111, and 0.382309, respectively, while the proposed LW-DBN model has the lower MSE value of 0.229603 for \(\hbox {T}=1100\) and \(\hbox {cf}=6\). Similarly, analysis based on the RMSE, depicted in Fig. 6b, shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN model has the RMSE value of 0.172144, 0.054514, 0.316843, and 0.171357 respectively for the \(\hbox {T}=1100\). The proposed LW-DBN algorithm achieved the RMSE value of 0.053109 for the \(\hbox {T}=1100\) and \(\hbox {cf}=6\).

D. Analysis based on varying the training % (k)

The comparative analysis based on the MSE, as given in Fig. 6c, shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN metrics achieved the MSE values of 0.542483, 0.235306, 0.385438, and 0.400784, respectively, while the proposed LW-DBN model has the better MSE value of 0.234182 for \(\hbox {k}=75\) and \(\hbox {cf}=6\). Similarly, analysis based on the RMSE depicted in Fig. 6d shows that the existing DBN, L-DBN, W-DBN, and the GA-DBN model has the RMSE value of 0.298485, 0.085124, 0.300974, and 0.300974 respectively for the \(\hbox {k}=75\) and \(\hbox {cf}=6\). The proposed LW-DBN algorithm achieved the RMSE value of 0.058772 for the \(\hbox {k}=75\) thus have improved performance over the other models.

4.3 Discussion

This section presents the comparative discussion of the models based on the crossfold validation, and without the crossfold validation and it is shown in Table 1. The bold values in Table 1 indicate the best performance. As depicted in Table 1, while not training the database without crossfold validation, the existing DBN, L-DBN, W-DBN, and the GA-DBN achieved the MSE values of 0.374351, 0.22853, 0.25173, and 0.53908, respectively. Training the database with various values of crossfold validation makes the existing works DBN, L-DBN, W-DBN, and the GA-DBN to achieve the MSE value of 0.404957, 0.23303, 0.56111, and 0.38230, respectively. Similarly, the DBN, L-DBN, W-DBN, and the GA-DBN models have attained the RMSE values of 0.13274, 0.05313, 0.06449, and 0.29289, respectively, for the database without crossfold validation, and the values of 0.175203, 0.05652, 0.18272, and 0.05442, respectively, for the database with crossfold validation. The proposed LW-DBN model has achieved lower values of MSE and RMSE with 0.222606, and 0.050435, for the database analyzed without the crossfold validation. Also, training the database with the crossfold validation makes the proposed LW-DBN more suitable for the prediction as it achieves minimum MSE and RMSE value of 0.229602 and 0.052406, respectively.

Table 1 Comparative discussion for LW-DBN

5 Conclusion

This paper primarily contributes towards the EDM by introducing the cluster based distributed architecture for predicting the student’s performance for the forthcoming academics. The proposed cluster-based distributed architecture predicts the student’s performance by collecting the academic performance of the students in previous records. The proposed architecture has two merger layer, and two distributive layers for performing various operations, such as clustering, feature extraction, and the prediction. The architecture uses the BFC for clustering the database, and the KPCA model for the feature selection from the selected clusters. The features are provided to the proposed LW-DBN algorithm for the training purpose, and the LW algorithm chooses the optimal weights for the prediction. The experimentation of the proposed work is done by varying the iteration, training percentage, dropout ratio, and the step out the ratio of the DBN network. The simulation results of the proposed LW-DBN is compared with the various existing works and analyzed based on the MSE and the RMSE metrics. The proposed LW-DBN model has achieved lower error performance than other models with MSE and RMSE values of 0.222606 and 0.050435 for the database. The future enhancement of this work is to utilize the advanced techniques for clustering and optimization to further improve the performance.