1 Introduction

In the field of data mining, classification algorithms are skilled for processing a huge amount of data. These algorithms are used to predict categorical classes for the test data based on training dataset and also used for classifying the new data instances. The field of classification covers many contexts in which a few decisions or forecasts are made on the basis of currently accessible information. The procedure of Classification is acknowledged as a method of frequently making conclusions in novel situations. Here, the assumptions are made for building a procedure that is applied to a continuous sequence of instances and in which every new instance is assigned to one of the groups of pre-defined classes based on observed features of existing data instances. A type of classification procedure in which the exact classes labels are known in training phase is the so-called supervised learning or pattern recognition [32]. The famous AI techniques for solving classification problem are: support vector machine (SVM) [30], k-nearest neighbor (kNN) [56] and artificial neural network (ANN) [39]. Furthermore, a group of techniques known as deep learning (DL) algorithms has been introduce recently for solving these sophisticated problems.

Deep learning is a subdomain of machine learning that learns high-level abstraction present in the data by using hierarchical learning. Deep learning algorithms are the advanced versions of traditional neural networks which possess some serious limitations. In traditional neural networks, if there are more layers and units, there will be a higher expressional power of the network which leads to more complexity of cost functions. The network learning will be difficult and cause overfitting. Overfitting can be reduced using several methods, for example keeping the network size smaller, maximizing the training part of the dataset and some other techniques. In order to overcome the limitations associated with traditional neural networks, deep learning algorithms have been introduced. It is an advanced approach and has been utilized in family of applications for example transfer learning [26], semantic parsing [44], computer vision [34], natural language processing [53] and many other complex applications. The reasons behind the usage of deep learning algorithms are the low cost of computing hardware, powerful processing capabilities and high level of advancement in the machine learning techniques. There are three famous DL algorithms in the most recent research, namely convolution neural network (CNN), restricted Boltzmann machine (RBM) and auto-encoder (AE). All of these deep learning algorithms have further many variants suitable for different types of applications.

Auto-encoders (AEs) are a kind of artificial neural network consist of three layers. These layers are input layer, hidden layer and output layer, which are used for learning of efficient encoding [50]. The AE reconstructs its own inputs instead of predicting outputs from the inputs. In auto-encoder, the output vector has the same dimension as the inputs. During the reduction process in auto-encoder, the purpose is to minimize the reconstruction error and learned features are actually the code generated by the encoder [4]. A single layer cannot extract informative features from the raw data. Therefore, in recent studies on AEs, researchers have used multiple layers to extract the most useful features from the raw data. AE have many variants, e.g., sparse auto-encoder [51], denoising auto-encoder [11], Saturating auto-encoder [14], convolutional auto-encoder [20], zero-bias auto-encoder [21] and contractive auto-encoder [13].

Boltzmann machine (BM) falls under the category of deep learning models that is based on probability distribution for machine learning. Restricted Boltzmann machine (RBM) is one of the famous variants of standard BM which was first created by Geoff Hinton [16]. Restricted Boltzmann machine (RBM) [23] is one of the famous variants of standard BM. The main purpose of RBM is reducing high-dimensional data into low-dimensional feature space. Since it is a probability-based approach, it is stochastic and generative in nature [3]. The internal architecture of the RBM is similar to other neural networks (NNs) having layers with neurons in each layer. Except that in RBM there are only two layers. The first layer is referred to the input layer of the network, while the second layer is the hidden layer that is output from the input layer [6]. There is a neural connection between the neurons of input layer and hidden layer. In standard BM, there exist connections between the neurons of the same layer, but in RBM, there is a restriction that none of the neurons in same layer can communicate with the neurons between them.

In this paper, we integrate the RBM layers with the CAE in order to improve the dimensionality reduction capability of CAE more efficiently. This integration of RBM layers with CAE creates a learning approach with the two major properties including robustness toward small noise or changes in the input features vector and the capability of learning high-order statistical feature from high-dimensional features space. The proposed CAE–RBM has the capabilities of both CAE and RBM which proves the significance of the proposed model. However, this paper has three main research contributions:

  1. 1.

    This paper gives a brief introduction, literature study and applications of CAE and RBM in solving the multiclass classification problems in different domain.

  2. 2.

    This paper proposed CAE–RBM approach which improves the dimensionality reduction ability of conventional CAE and enhancing the learning of high-order statistical features by integrating the RBM layers during feature reduction phase.

  3. 3.

    This paper conducted an extensive experiments on the proposed CAE–RBM model, in addition with state-of-the-art multiclass classification algorithms on benchmark handwritten digit images dataset.

The rest of the paper is organized as follows: Sect. 2 shows related work; Sect. 3 presents the conventional contractive auto-encoder; Sect. 4 shows the proposed CAE–RBM approach; Sect. 5 shows experimental results; Sect. 6 shows the comparative analysis of results; and Sect. 7 presents the conclusion and future work.

2 Related Work

In the field of multiclass classification, supervised behavior learning algorithms are aimed to assign a designated class to every input instance [43]. For an input dataset of \((a_j, b_j)\), where \(b_j \epsilon R_n \) is the \(j^{th}\) instance and \(b_j \epsilon {1, 2, 3... k} \) is the \(j^{th}\) class label, the main purpose is to find a model for learning H, while \(H(a_j) = b_j\) for all of the testing instances. Many models and algorithms have been proposed in the last few decades for solving the binary class classification problems, most of them are easy to extend for multiclass classification problem, while some of them use special formulation to solve multiclass classification problem [8, 31]. Solving the multiclass classification problem is challenging [24]. In state-of-the-art classification approaches, the multiclass classification problem is solved using the splitting of problem into many independent binary classification subproblems. Image classification is the famous application of classification in which images are being categorized on the basis of contents and object it consists [12]. Image classification is one of the challenging problems in the field of classification and decision making [1, 41].

In the recent literature studies, generative adversarial networks (GANs) has been widely used in well-known image generation applications including image super-resolution [19, 55], image-to-image translation [45] and text-to-image transformation [47]. The authors in [54] have proposed the self-attention GAN and named it SAGAN. Their proposed SAGAN has auto-attention mechanism for generating convolutions in convolutional GANs. This auto-attention helps the network in modeling the multilevel dependencies for long-range image regions the results in drawing the images with finer details in distant portion of image. Additionally, its discriminator can enforce complex geometric constraints more accurately on the image structure globally. In [29], the authors presented the idea of adding least square methods in GANs and formed an enhance network called LSGANs. The proved by their experiments that reducing the cost of objective function can ultimately yields the minimization of Pearson divergence. The tested their proposed LSGAN using LSUN and CIFAR-10 benchmark datasets based on the training stability and performance. They also applied it to Chinese handwritten characters datasets containing 3740 classes.

Moreover, the recent literature gives a new non-iterative approach for multiclass classification called a neural-like structure of SGTM (successive geometric transformation model). The SGTM neural-like structure of models is more effective on the processing time of large datasets in classification task [48]. However, the processing time can be dramatically reduced by the usage of non-iterative greedy-based training approach in high-speed SGTM neural-like structure. All the mathematical and statistical proof and description of the training procedure and formulation of activation functions along with its computation intelligence are presented in [17]. In [49], the authors have used the SGTM neural-like structure in a hybrid with Kolmogorov–Gabor polynomial for solving the regression problems on large-sized datasets. According to the deep literature study, [48] has provided enough evidence to prove the effectiveness and significance of the non-iterative SGTM neural-like structure for solving different types of regression, classification and prediction problems.

As the study of image processing and classification becomes vaster in the last few decades, the performance of many deep learning (DL) approaches is promising. One of the reasons of DL-based approaches is it subtype called convolutional neural network (CNN) specifically for solving image classification problems [15]. A typical CNN model is a feed-forward neural network that consists of different layers including convolutional layers, pooling layers and fully connected layers, which emphasis on multidimensional correlation by connecting the neuron of each adjacent layer. When the neuron of one layer is connected to all the neurons on the proceeding layer, the architecture is called deep CNN (DCNN) [40]. DCNN captures the quick interest of researchers for the solution of image classification and processing problem. In [22], the authors used a variant of DCNN called ImageNet for large scale visual recognition challenge in 2012. The winners of that challenge with new record breaking results, used DCNN for classification of approximately 1.2 million images in 1000 classes. Since then, the subsequent variants of DCNN got and interest of most researchers for image classification.

The authors in [33] proposed a dimensionality reduction approach for image classification. The proposed cross-attention mechanism and graph convolution integration algorithm boosts hyperspectral data classification efficiency. PCA is used to reduce the dimensions of hyperspectral images to achieve more expressive low-dimensional features. The model employs a cross-attention strategy to jointly assign weights based on its two strategies and then employs a graph convolution algorithm to establish directional relationships between the features. Deep features, as well as the relationships between them, are used to predict hyperspectral data. Three hyperspectral datasets show that the proposed algorithm outperforms other algorithms using various training set division methods. Additionally, [9] presented a new dimensionality reduction method for face recognition called Biomimetic uncorrelated locality discriminant projection (BULDP). It is based on unsupervised discriminant projection and two human bionic characteristics: homology continuity and heterogeneous similarity principles. A new representation has been proposed based on these two human bionic features that can collect category information between different samples and represents consistency between similar samples and the similarity between different samples. Besides, it also can transform the original space into an uncorrelated discriminant subspace using the new unsupervised discriminant projection [35]. Singular value decomposition provides a detailed solution to the BULDP. A nonlinear version is developed using kernel functions for BULDP that minimizes nonlinear dimensionality. Four publicly accessible face recognition benchmarks are compared to state-of-the-art methods. The experimental results prove that the BULDP achieves competitive recognition efficiency performance.

In the most recent literature, enormous number of researcher are working the improvements of image classification. From a deep overview of the existing research, it has been concluded that proper dimensionality reduction is one of the most important and compulsory steps in image classification [36]. Though the deep learning approaches are considered as one of the better approaches for solving a number of multiclass classification problems, still there are few weaknesses related to deep learning that encourages researchers to overcome these weaknesses. The first weakness related to these approaches is the failure in application of independent data dimensionality reduction based on preferences. This failure is the inability of model to capture the useful information from the data which defects the quality solution of the classification [7]. Therefore, to overcome the dimensionality reduction problem associated with these models, many researchers have improved and hybridized the standard version of CAE for solving multiclass classification problem. [25] and [27] proposed the stacked versions of CAE, in order to increase the dimensionality reduction capability of CAE. They overcome this issue to some extent according to their research requirements, but it led to an increase in the complexity of their proposed model.

3 Contractive Auto-encoder

[38] firstly proposed the idea of contractive auto-encoder (CAE). CAE is a variant of AE family that aims to show robustness and senseless toward minor variation in the training datasets during encoding process. CAE is based on an additive penalty term called “Frobenius norm of the Jacobian matrix,” which is used to reduce the cost of objective function. The result of this modification to AE is to minimize learned representation with respect to the input which increases the robustness of CAE toward the training data sample [2]. CAE is commonly used as the same as other variants of AE nodes, particularly when there exist some noise in the input data where usually other encoding algorithms fail to classify the data points. The aim of denoising auto-encoder (DAE) and CAE is used to bring robustness in the model, but the working mechanism is different for both [10, 52]. DAE injects noise in the data to bring robustness in the model, whereas the CAE adds analytic contractive penalty to error reconstruction function [25, 27].

Although CAE is considered as one of the most powerful approaches used for solving various types of multiclass classification problems, some drawbacks are associated with this algorithm which needs proper attention to develop a technique that leads to a problem-independent and high-quality solution generation for solving these complex problems. Same as other classification approaches, CAE perform the classification task in three major stages, namely features extraction, feature reduction and classification. The major drawback associated with standard CAE is its failure in application independent data dimensionality reduction according to user satisfaction. The result of this failure is the incapability of the CAE model to capture the finer details possessing the useful information. Resultantly, it leads to low-quality solution of the classification problem. The work carried out in this research has targeted the issue associated with conventional CAE while solving multiclass classification problem in order to enhance its performance.

The whole processing inside CAE takes place in two parts, namely encoding and decoding.

3.1 Encoding

The process of mapping the input feature set to transform it to give as intermediate representation to the hidden layer is called encoding given by Eq. 1:

$$\begin{aligned} y = f(x) = s_e (w_x + b_h) \end{aligned}$$
(1)

where f(x) represents the outputs of the input layer which is given as inputs to the hidden layer, w represents weights given to each input and b represents the biasness value associated with input feature set.

3.2 Decoding

The process of mapping the output of the hidden layer back into the input feature set is called decoding and is given by Eq. 2:

$$\begin{aligned} r = g(y) = s_d (w_y + b_r) \end{aligned}$$
(2)

where g(y) represents the output of the hidden layer, w represents the weights given to the inputs of the hidden layer and b represents biasness of the inputs to the hidden layer. In both encoding and decoding functions, weights w and biasness b values are generated randomly based on the number of nodes in that layer.

The \(s_e\) and \(s_d\) are the encoding and decoding activation functions, respectively, and are given by Eq. 3 for nonlinear representation (sigmoid function), whereas by Eq. 4 for linear representation (hyperbolic tangent function):

$$\begin{aligned} sigmoid(x) = \frac{1}{(1 + e^{-x})} \end{aligned}$$
(3)
$$\begin{aligned} tanh(x) =\frac{ (e^x - e^{-x})}{(e^x + e^{-x})} \end{aligned}$$
(4)

The major aim of the reconstruction is to generate the outputs as much similar to the original inputs as possible by reducing the reconstruction error. The following parameter set is used to reconstruct the original inputs by reconstruction layer:

$$\begin{aligned} \theta = [W, b_h, b_r] \end{aligned}$$
(5)

Suppose, we have the input feature set as \(Di = [x_1, x_2, x_3, ...., x_n]\), then the reconstruction error is minimized by minimizing the following cost function presented in Eq. 6:

$$\begin{aligned} J_{AE}(\theta ) = \sum _{x \epsilon D_i}R(x,r) \end{aligned}$$
(6)

where R is reconstruction error. In case of linear representation, it is the Euclidean distance, whereas in case of nonlinear representation, it is the cross-entropy loss. In order to avoid overfitting and penalizing the large weights rose from Eq. 6, the simplest form of Eq. 6 is Eq. 7:

$$\begin{aligned} J_{AE}-w_d(\theta ) = \sum _{x \epsilon D_i}R(x,r) + \frac{1}{2}(\lambda ||W||) \frac{2}{2} \end{aligned}$$
(7)

in which the relative importance of regularization is controlled by weight coefficient decay \(\lambda \). From the inspiration of learning robust feature set, in [38] the authors proposed a CAE with an unconventional regularization yielding objective function presented in Eq. 8 based on Eqs. 6 and 7.

$$\begin{aligned} J_{CAE}(\theta ) = \sum _{x \epsilon D_i}R(x,r) + \frac{1}{2}(\lambda ||f(x)||) \frac{2}{F} \end{aligned}$$
(8)

In Eq. 8, the Jacobian matrix is \(f(x) = \partial f(x)/ \partial x \) of the encoder f at value x. The mapping of feature set in order to be contractive in local domain of training data is encouraged by adding the penalty of Frobenius norm of encoder Jacobian, for example the intermediary representation of features that are robust to the minor variations or noise in the input data.

In [37], the authors additionally deliver an experimental proof that the trade-off between the CAEs regularization term and the reconstruction error which produce a representation that captures the local representation of variation verbalized by the data. It is frequently corresponding to a low-dimensional non-directed manifold, on the other hand, being additional invariant to the enormous majority of directions orthogonal to the manifold. The contractive additive term for a sigmoid encoder is easy to calculate.

$$\begin{aligned} J(x)= & {} f(x)_j (1-f(x)_j)W_j \end{aligned}$$
(9)
$$\begin{aligned} ||J(x)||_{F}^{2}= & {} \sum _{j=1}^{d_h} (f(x)_j (1-f(x)_j))^2 ||W_j||^2 \end{aligned}$$
(10)

The computational complexity by using Eq. 10 is same as the computing cost of a linear reconstruction error.

$$\begin{aligned} R(x,r) = ||x-b_r - \sum _{i=1}^{d_h} f(x)_j W_j ||^2 \end{aligned}$$
(11)

For example, the squared error is equal then calculating the gradient update and objective in CAE is just a double expensive as compared to and conventional AE. While the overall complexity of both is approximately equal, that is equal to \(O(d_h d_x)\). The step-by-step working mechanism and architecture of conventional CAE are presented in Algorithm 1 and Fig. 1, respectively.

figure a
Fig. 1
figure 1

Standard CAE internal architecture

4 The Contractive Auto-encoder with Restricted Boltzmann Machine (CAE–RBM)

The proposed CAE–RBM model focuses on the solution of first problem associated with the standard CAE that relates to the dimensionality reduction. We named the model as CAE–RBM because the enhancement has been made based on the addition of restricted Boltzmann machine hidden layers. It maps high-dimensional input data to a lower-dimensional representation, in which the encoder used to compress the input, whereas the decoder refers to the reconstruction of original input from the lower-dimensional feature representation. A cross-entropy loss function quantifies the information loss derived from the deviation between the original input and the reconstructed output. The goal of the training is to minimize the information loss of the reconstruction. Because target labels for the reconstruction are generated from the input data, the CAE–RBM is observed as self-supervised. In this model, RBM layers are added with CAE in order to enhance the dimensionality reduction property of CAE for solving multiclass classification problem efficiently. The addition of RBM layers with standard CAE designs a learning approach with the two major properties: a) Firstly, it is robust toward small noise or changes in the input features vector, and b) secondly, it has the capability of learning high-order statistical feature from high-dimensional features vector due to RBM layers. The proposed CAE–RBM has the advantages of both CAE and RBM. The basic operational procedure of the proposed model consists of two major stages, namely the training of the CAE layers and the RBM layers. In the first stage, all the parameters associated with the CAE and RBM are initialized. Some of the CAE parameters that have been considered include number of hidden layers, input feature set, encoding and decoding activation functions, input weights and bias values. The parameters that have been considered for the RBM include input feature set, weight matrix, visible layer and hidden layer bias vectors. After initializing these parameters, the CAE working mechanism starts in the second stage. The extracted features vector is given as input to the CAE input layer. The encoding process occurs from input to hidden layer of CAE and the decoding process happens during information transfer from hidden layer to the output layer of CAE along with reconstruction error. If the reconstruction error is below than the threshold value, the information is then propagating to the third stage. To find the best threshold value that provides better results, in every epoch the model keep updating threshold value in order to achieve the final accuracy, which did not increase or decrease further. The third stage of the proposed CAE–RBM is based on the working procedure of RBM, where the output from the CAE hidden layers is transformed to the RBM layer for learning the high-order statistical features. The hidden layer nodes for RBM are represented by \(H_i\), and \(i \epsilon [1,2,3,...x]\). Visible nodes are represented by \(V_j\), such that \(j \epsilon [1,2,3,...y]\). The weight connection between hidden and visible layers is represented by \(W_{ij}\). The weights for RBM hidden layer nodes are also calculated randomly in a sequence based on its input from CAE. The final equation for energy of RMB is:

$$\begin{aligned} E_rbm(v,h|\phi )= & {} -\sum _{i=1}^{x} v_i a_i -\sum _{j=1}^{y} h_j\nonumber \\&\times b_j \sum _{i=1}^{x} \sum _{j=1}^{y} w_{ij} \times v_i \times h_j \end{aligned}$$
(12)

In Eq. 12, \(\phi = (W_{ij}, a_i, b_j)\), all these are real numbers. The bias value of hidden layer is denoted by \(b_j\) and the \(b_i\) represents the bias value of visible layer. The likelihood function that is also called the joint probability distribution of visible and hidden layer is formulated in Eq. 13, if \(\phi \) is known.

$$\begin{aligned} P(v,h|\phi ) = \frac{1}{z} exp(-Ev,h|\phi ) \end{aligned}$$
(13)

The normalization parameter \(Z = \sum (v,h) exp(-E(v,h|\pi )) \). This visible and hidden layer’s activation functions can be formatted in Eqs. 14 and  15, respectively.

$$\begin{aligned} P(h_j = I|v) = sigmoid \left( \sum _{j}^{x} v_i \times w_{ij} + b_j\right) \end{aligned}$$
(14)
$$\begin{aligned} P(v_i = I|v) = sigmoid \left( \sum _{i}^{y} h_j \times w_{ij} + a_i\right) \end{aligned}$$
(15)

\(sigmoid(x) = \frac{1}{1} + e^{-x}\). Probability that the first \(j_th\) hidden node is activated is presented by Eq. 14, while the activation of first \(i_th\) visible node is presented in Eq. 15. Finding the best suitable parameter set that is possible for \(\phi \) is the main purpose of RBM training. The input data can be betterly fit by the trained model. Calculate the best value of \(\phi \) using maximum likelihood function based on log by giving training sample \(T_r\), such that \(T_r = [v1, v2, v3, vT_r]\). The outputs of the RBM hidden layers are then proceed to the fourth stage. In the fourth stage, the output feature vector from RBM layers is given as input to the softmax layer for final classification process. The complete structure of the proposed model is shown in Fig. 2 and explained in Algorithm 2, while the internal layered architecture is presented in Fig. 3.

figure b
Fig. 2
figure 2

CAE–RBM working flow

Fig. 3
figure 3

Architecture of the proposed CAE–RBM model

5 Experimental Results

In this section, we conducted the experiments to evaluate the feature learning ability of the proposed and comparative models, toward better solution for multiclass classification problems. All the experiments are carried out on Intel core i5 CPU with 8GB of RAM having windows 10 operating system. Python 3.6 is used as compiler and language used for developing and testing these algorithms. For quick implementation of the proposed and comparative approaches, an efficient numeric computational open-source library Tensorflow [5] is used which allows a simple and fast development for both CPU and GPU support. Table 1 presents a summary of parameters setting for the proposed and comparative approaches.

We used four variants of benchmark MNIST dataset for evaluation and comparison. MNIST is a handwritten digit images dataset that contain 70,000 images from 0 to 9. Each image in this dataset has a size of 28x28 pixels. In the conducted experiments, 12,000 images are considered for evaluation and comparison with some comparative models. The experiments performed with 70:30 ratio of training and testing data. Six random images from each MNIST variant dataset are shown in Fig. 4.

5.1 CAE–RBM-Based Feature Reduction

The feature reduction capability of the CAE–RBM model for all the datasets is presented in this section. The main role of CAE in multiclass classification is the dimension reduction of feature space, while the RBM performs better in learning high-order statistical features. The purpose of merging both CAE and RBM in the developed CAE–RBM is to enrich the model with the advantages of both CAE and RBM. The output of the experiments for feature reduction of the developed CAE–RBM is presented in Fig. 5. The first row contain the original images, the second row shows the encoded features images, while the third row consists of the decoded images from reduced feature set. The information loss in dimensionality reduction is given in Table 2.

Table 1 Parameter tuning of the proposed and comparative models
Fig. 4
figure 4

Sample images from different MNIST variant datasets

Fig. 5
figure 5

CAE–RBM-based encoded and decoded sample images from different MNIST variant datasets

5.2 CAE–RBM Measures

This section shows the experimental results of the proposed CAE–RBM model with the reference of confusion matrix and ROC. Confusion matrix and ROC curve present the detailed results of the correctly classified and misclassified instances at class level. The output results of the proposed model on all of four aforementioned datasets are shown in Figs. 6, 7, 8 and 9. In these experiments, the core CAE–RBM model leverages a softmax classifier to find out the overall classification behavior on images data. All these figure presents the output confusion matrix and roc curve for 70:30 training–testing ratio. The highest accuracy is observed for class 1 followed by class 6 and class 0 because their features were more distinct as compared to other classes. On the other hand, the lowest accuracy can be seen for class 8 followed by class 9 and class 5, and mainly these classes were misclassified to one another because of high similarity in features.

6 Comparative Analysis

The comparative analysis of the proposed CAE–RBM with different state-of-the-art multiclass classification models including ANN, standard CAE and standard RBM is presented in here. In this comparison, accuracy and complexity are used as performance evaluation attributes. Testing accuracy and precision/recall are compared in order to compare the accuracy, whereas execution time with \(big-O\) notation is selected for complexity. However, in image processing, most of the time is consumed by the image representation learning. We reduced the high-dimensional features to low-dimensional feature vectors for efficiently reduce the time complexity of the proposed CAE–RBM. It can be further reduce by increasing the memory capacity, because memory consumption is directly proportional to representation learning. Therefore, extending memory will enhance the representation learning speed.

6.1 Accuracy

All of the experimented models are trained and tested using the same architecture for pre-classification steps. Table 3 lists all the models with their training accuracy rates. In order to conduct most fair results, the experiments for each of the model has been repeated for five times. All the testing accuracy results of the proposed CAE–RBM and state-of-the-art models discussed in Table 4 are the mean value these repeated experiments. Furthermore, with the training and testing accuracies, Table 5 provides more explanatory performance evaluation metric based on the precision and recall value the proposed and comparative models. The final results conclude that CAE–RBM with Softmax classifier outperforms ANN, CAE and RBM.

Table 2 Information loss during CAE–RBM dimensionality reduction
Fig. 6
figure 6

CM and ROC for MNIST small subset (basic)

Fig. 7
figure 7

CM and ROC for MNIST random rotation digits (rot)

Fig. 8
figure 8

CM and ROC for MNIST random noise background digits (bg-rand)

Fig. 9
figure 9

CM and ROC for MNIST random background digits (bg-img)

6.2 Time Complexity

Besides the different computational behavior of different algorithms, all experiments were carried out on same hardware and software architecture as mentioned earlier. The complexity of a neural network is based on its architecture, the number of layers and nodes per each layer. In the conducted experiments, a four hidden layered architecture has been considered. For computing the activation of all nodes in layer L requires \({\mathcal {O}}(L(n^2))\), where n is the number of features at layer L. The final complexity for ANN in conducted experiments requires \({\mathcal {O}}(4n^2)\) and AlexNet \({\mathcal {O}}(4n^3)\). The standard RBM has a runtime complexity of \({\mathcal {O}}(n^2)\) for a single hidden layer [46]. In the experiments performed followed the same architecture of four hidden layers, according to which the overall complexity of RBM in experiments become \({\mathcal {O}}(4n^2)\), where n is the number of features in each layer. In a conventional CAE, the complexity has been divided according to encoding and decoding functions. The complexity in calculating the compressed layer is \({\mathcal {O}}(xn)\) and decoding requires \({\mathcal {O}}(xh)\) runtime. In such case, the overall complexity of a four-layered CAE become \({\mathcal {O}}(4(xn + xh))\), where x is the number instances, n is the number of encoded features and h refers to number decoded features.

In the proposed model, the CAE and RBM advantages have been merged but as the internal architecture has been kept same for developed and comparative models so the complexity is not increased like other hybridized model. The computation runtime of CAE layers is \({\mathcal {O}}(4(xn + xh^2))\), where x is the number instances, n is the number of features in CAE hidden layer and h refers to the number of features in RBM hidden layer.

Table 6 summarizes the runtime of different competitive models from the literature based on MNIST variant datasets. It shows a clear observation of the time taken by all of the considered models and algorithms, i.e., ANN, RBM, AlexNet and CAE with four hidden layers, which is greater than that of CAE–RBM. The time taken by CAE–RBM is less because of the randomization factor, which boostup the training but did not increase the computational complexity of the model. Although the difference between the time consumed by CAE–RBM and other approaches is very small, collectively with accuracy and time complexity CAE–RBM presents better performance. Nevertheless, in the overall experimental results the proposed model outperformed the state-of-the-art models with the same architecture and parameter tuning.

Table 3 Training accuracy of different models on MNIST variant
Table 4 Testing accuracy of different models with softmax classifier on MNIST variants
Table 5 Precision and recall of different models on MNIST variants
Table 6 Execution time of different models on MNIST variant

7 Conclusion

The main focus of this paper is to propose a better approach for mutliclass classification. Here we present the CAE based on RBM (CAE–RBM) and evaluate the performance of proposed model. We use CAE for dimensionality reduction and RBM for learning high-order statistical feature. The output of the developed model is compared analytically with existing state-of-the-art classification algorithms including standard CAE, RBM, AlexNet and ANN. The results presented in different figures and tables are based on four benchmark datasets of MINST. In these experiments, 12000 images are randomly selected from each of the five benchmark datasets. The experiments performed with 70% training and 30% testing ratio of the each dataset. The performance is evaluated in term of accuracy and time complexity. In order to validate the performance of the proposed model, the class level classification results are presented in the form of confusion matrices and ROC curves. The experimental outputs project a minor decrease in the accuracy starting from MNIST basic dataset to the most complex MNIST random background digits dataset. This gradual increment in the complexity of dataset is directly related to the accuracy decrement. In the developed models, CAE–RBM outperformed all the comparative models based on accuracy and time complexity. On the other hand, the proposed models do not consider the space complexity of the algorithm for less complex and small-sized datasets. The CAE–RBM has been added the functionalities standard CAE from RBM, both of these algorithms have different working procedures for processing the information in order to perform the final classification phenomenon. So, merging the concepts of one algorithm in another algorithm involves many technical procedures. In CAE–RBM, the RBM layers are merged with the CAE layers and during this process, the complex computational steps of both CAE and RBM have not been removed which may increase the space complexity of the model. In future, we will proceed with the space complexity reduction of the proposed model and perform experiments on the Big Data application.