1 Background and context

Auto-encoders (AEs) are unsupervised neural network that applies back propagation behaviour, setting up the high dimensional input feature set into a low dimensional output feature set and then recover the original feature set from output [1]. The reduction procedure of high dimensional data to low dimensional data is known as encoding while the reconstruction of original data from low dimensional data is called decoding. Auto-encoder (AE) was proposed to improve the reconstruction reliability of low dimensional feature set. AEs have the following main sub-models: sparse auto-encoders (SAEs) [2], denoising auto-encoder (DAEs) [3], laplacian regularized auto-encoder (LAE) [4, 5], coupled deep auto-encoder (CDA) [6], hessian regularized sparse auto-encoders (HSAE) [7], nonnegativity constraints auto-encoders (NCAE) [8], multimodal deep auto-encoder (MDA) [9], Bayesian auto-encoder (BAE) [10] and contractive auto-encoder (CAEs) [11].

In SAE, a sparsity factor is added to the original input nodes. There are two major ways for the representation of sparsity in AEs: firstly, penalization of the hidden layers biasness and secondly, the outputs of the hidden layers are used directly for the purpose of penalization. When solving numerical optimization problems, the weights are compensated by this penalty bias that leads to degrading the performance and efficiency of the algorithm. This is one of the important and worth attention drawbacks of the conventional AEs. For the reconstruction of noisy data from the original data, the authors in [12] proposed DAEs. The DAEs are used for two purposes: firstly, encoding the noisy data and secondly, recovering the original input data from the reconstructed output data. When the data encoders are stacked in different layers, they form stacked DAEs. For minimizing the classification error, an extra layer is used by stacked DAEs. Stacked AEs are so named due to the phenomenon of putting the conventional AEs in a stack as highlighted by [13]. However, in this architecture, the layers are arranged in sequential order rather than parallel.

In order to enhance the performance of state-of-art AEs, many researchers have presented their additive to the AEs. e.g. In order to stop the locality preserving property for the data points of AEs, in LAE the author introduced some modification by adding the Laplacian regularization penalty to standard AE [5]. Also [7] for the purpose of reserving local structure data points in learning strategies, Hessian regularization has been added to SAE and formed HSAE which improves the robustness to noise along with sparse constraint. In [14] the author added non-negativity constraints to SAE on weight matrices to form NCAE, in order to improve the input features data reconstruction and enhancing the ability to disentangling the input hidden points geometry. While [15] presented MDA, which comprise of three stages, the first stage consists of an AE that is able to learn the internal structure of data points for 2D images. The second stage is based on two layered neural network for transforming 2D images to 3D representation and the third stage is again AE for learning the hidden data points for 3D poses. Furthermore, CDA proposed a Bigdata driven architecture which comprises two auto encoders having the ability to learn intrinsic features by extracting the hidden feature set from low resolution and high resolution image patches [6]. In [16] the author applied BayesianNet to standard autoencoder for building a multi layered BayesNet called adversarial variational Bayes auto encoder (BAE). The main purpose of BayesNet architecture in BAE is conditional probabilities adjustment for better prediction. BAE performs its operation as like BayesNet, therefore it learns the feature using belief propagations. [17] introduce multimodal video classification framework. They also performed a two-stage training architecture for learning a set of mapped latent features that capture both intra-modal and inter-modal semantics. In first stage they trained separate SCAE on three different features vectors that are audio, video and text extracted from the video. And in second stage they combine all modalities together to learns a multimodal stacked contractive auto encoders (MSCAE). In order to increase the robustness of standard AE, the researcher come up with Contractive Autoencoders (CAE). CAE are considered as the extended forms of the DAEs in which contractive penalty is added to the error function of the reconstruction. This penalty is, in turn, used for penalizing the attribute sensitivity in the input variations. The major drawback associated with CAEs penalty is its consideration for the minuscule diversity of the input data values. The authors in [18] addressed this issue but failed to fully resolve the problem and there is still much room for further improvement.

2 Related work

In advance classification methods and models, the interconnected network of artificial neuron, called artificial neural network is of high interest among researchers [19,20,21]. Each neuron in the network describes a feature and the deep layers in the network present more essential features as compared to the previous layer. The exceeding number of feature results a complex network [22, 23], therefore researchers present AEs for reducing the features set from a high dimensionality to a lower one. While in this AEs based feature reduction models, there is a loss of informative features. To solve this issue, many researchers have provided their efforts and ideas in the literature and proposed some variants of AEs. These AEs variants are used in different domains of classification for example, in [13] the authors used sparse multi layered auto encoder framework for auto-detection of nuclei on a set of 537 marked histopathological breast cancer images. The input data is splitted in two subgroups: training (37 images) and testing (500 images). The whole architecture consists of 1 input layer, 2 hidden layers and 1 output layer. They stacked the sparse Autoencoder in their work and used Softmax for classification. There were 3468 nodes as input to the input layer, 400 nodes in the first hidden layer and 255 nodes in the second hidden layer. The output from second hidden layer was an input to the final Softmax layer, which is mapped to two classes, either 1 or 0. Further they perform comparative analysis of Stacked Sparse Autoencoder with Color Convolution, Convolution Neural Network and Expectation maximization based nuclei auto-detection.

The authors in [12] has adopted a weighted reconstruction loss function to the conventional DAE for noise classification in speech enhancement system. They stacked several weighted DAEs to construct the model. In their experiments, they performed 50 steps with the number of input nodes from 50 to 100. An unnoisy data comprises of 8 languages and a white noise having SNR of 6, 12 and 18 dB is selected from NTT database is selected for the purpose of model training. Their model was trained by 1hour length of data and was tested on a data of 8min length. The author in [4] applied a regularized function framework in learning feature to AE, in order to enhance the locality-preserving of input data points on the manifold during encoding and named it LAE. They used benchmark datasets MINST and CIFAR for recognition of hand-written digits and object recognition respectively. In MINST they used 50,000 images for training while 10,000 for validation and 10,000 for testing purpose. In CIFAR-10, they used 50,000 images for training object recognition model and 10,000 for testing of their model. There were 3468 input nodes as input to the model and 10 output classes.

In [6], the authors proposed CDA, based on an individual architecture which learns the intrinsic representations of low-resolution and high-resolution image patches for single image super-resolution simultaneously. This research used two datasets, that are Multi-PIE and CHUNK face Sketch FERET dataset for validation. They modified all the images in both dataset to 64x80 pixels before preprocessing. There were 7 comparative experiments conducted on Multi-PIE dataset for evaluating face recognition in different poses and 8 experiments on CHUNK face Sketch FERET dataset for drawing sketch with shape exaggeration.

Liu et al. [7] proposed Hessian Regularized Sparse Auto-Encoders (HSAE) for enhancing the internal geometry of auto encoder and robustness to noise using Hessian regularization and sparse constraint respectively. This research also stacked the HSAE to form deep architecture for better performance. It performed classification experiments using MNIST and CIFR-10 benchmark datasets. The number of input features are 1024 and the output layer has 10 classes. The performance evaluation parameter considered in the experimentation was the classification accuracy which was found to outperform the basic auto-encoder, Laplacian auto-encoder, sparse auto-encoder and Hessian auto-encoder when applied to the same datasets.

Chorowski et al. [8] applied a non-negative weights constraint to basic auto-encoder and form a deep architecture called NCAE. Firstly the non-negative weights constraint was applied to unsupervised training part of auto-encoder and the secondly it was applied to stage of fine tuning. They conducted the experiments based on classification using MNIST set of hand written digits, small NORB set of object images, Reuters-21578 text corpus set and ORL set of images benchmark datasets. Reconstruction error, sparseness of hidden encoding and part-based representation of features, in unsupervised learning phase were used as performance metrics in this work. The practical experiments showed that NCAE outperformed Nonnegative Sparse Autoencoder (NNSAE), Dropout Autoencoder (DpAE), Denoising Autoencoder (DAE), and Sparse Autoencoder (SAE).

In addition, the authors applied CAEs for the purpose of video semantics classification by introducing a two phases learning framework based on CAE [17]. In order to learn the discriminative feature set by stacking CAE for representing multi-modal fine tuning from single-modal pre-training, they put inter-modal and intra-model semantics under importance. To verify and validate their model, they stacked all the comparative models based on multi-modal fine and tuning and single-modal pre-training same as like their own multi-modal SCAE. In the first phase of their model, image representation was reduced from 546 to 128, text representation reduced from 1285 to 128 and audio representation was reduced from 38 to 20, using a two layered architecture. In the second stage all the adjacent connected feature was optimized from 276 to 128 collectively while in the 3rd layer to 32 and 16 finally. The final evaluation explained that the multi-modal SCAE outperform support vector machine (SVM) and Linear Discriminant Analysis (LDA) [24] based on 10-fold cross validation.

3 The proposed model

Features reduction is one of the most important steps in solving big data problem and dealing complex data with high dimensional attributes. In fact the feature reduction is necessarily valuable method for any classification and prediction model [25, 26]. It is actually the better way for splitting the useful and effective features from ineffective and useless feature within the features representation space [27]. However, if there are irrelevant raw feature given as input may cause failure of better feature reduction and also results inefficient classification. In the proposed deep contractive auto encoders (D-CAE), three parallel CAEs are used for feature reduction. All the CAEs are arranged in layers and trained in a feedforward nature for minimizing the objective function in order to minimize the reconstruction error. The minimization of objective function and reconstruction error results in reducing the classification error. This section has been divided into three phases. The first phase explains the working mechanism of auto encoder. Secondly, the contractive auto encoder operational steps are explained and the last phase presents the overall work flow of the proposed three layered deep contractive auto encoder.

figure a

The internal structure of a conventional autoencoder is similar to a standard neural network with three layers. The autoencoder’s whole processing takes place in two parts: encoding and decoding. The process of encoding and decoding take place in all layers, i.e layer-1, layer-2 and layer-3. In order to clarify the structure of each layer in the proposed D-CAE, it is only highlighted in first layer. All the proceeding layers follow the same architecture as shown in Fig. 1.

Fig. 1
figure 1

Architecture of the proposed model

3.1 Encoding

The process of mapping the input feature set to transform it to give as intermediate representation to the hidden layer is called encoding. Mathematically the process of encoding is given by Eq. 1.

$$\begin{aligned} y = f(x) = \phi (W_x + b_h) \end{aligned}$$
(1)

where f(x) represents the outputs of input layer which is given as inputs to the hidden layer h. \(\phi \) represents the encoding activation function W represents weights given to each input and b represent the biasness value associated with input feature set.

3.2 Decoding

The process of mapping the output of the hidden layer back into the input feature set is called decoding. Mathematically the process of decoding is given by Eq. 2.

$$\begin{aligned} r = f(y) = \phi ^o (W_y + b_r) \end{aligned}$$
(2)

where g(x) represents the output of hidden layer, W represent the weights given to the inputs of the hidden layer, b represents biasness of the inputs to the hidden layer.

The \(\phi \) and \(\phi ^o\) are the encoding and decoding activation functions respectively, and are given by Eq. 3 for nonlinear representation (sigmoid function) whereas by Eq. 4 for the linear representation (hyperbolic tangent function).

$$\begin{aligned}&Sigmoid(x) = 1/(1+e^{-x}) \end{aligned}$$
(3)
$$\begin{aligned}&\tanh (x) = (e^x - e^{-x})/(e^x + e^{-x}) \end{aligned}$$
(4)

The major aim of the reconstruction is to generate the outputs as much similar to the original inputs as possible by reducing the reconstruction error. The following parameter set is used to reconstruct the original inputs by reconstruction layer.

$$\begin{aligned} \Theta = \left[ W, b_h, b_r \right] \end{aligned}$$
(5)

Suppose, we have the input feature set as: \(D_i = [x_1, x_2, x_3,ldots, x_n],\) then the reconstruction error is minimized by minimizing the following cost function

$$\begin{aligned} JAE(\Theta ) = \sum _{x \epsilon D_i} R(x,r) \end{aligned}$$
(6)

where R is reconstruction error. Since minimization take place at the decoding phase of conventional CAE, it is applied all of the three layers in D-CAE. In case of linear representation, it is the Euclidian distance whereas in case of nonlinear representation, it is the cross entropy loss. In order to avoid overfitting and penalizing the large weights rose from Eq. 6, the simplest form of Eq. 6 is Eq. 7:

$$\begin{aligned}&JAE - wd(\theta ) = \sum _{x \epsilon D_i} R(x,r) \nonumber \\&\quad \,+\,1/2(\lambda ||W||) 2/2 \end{aligned}$$
(7)

In which the relative importance of regularization is controlled by weight coefficient decay \(\lambda \). Based on Eq. 6 and Eq. 7, the contractive auto encoder becomes Eq. 8.

$$\begin{aligned}&JCAE (\theta ) = \sum _{x \epsilon D_i} R(x,r) \nonumber \\&\quad \,+\,1/2(\lambda ||f(x)||) 2/F \end{aligned}$$
(8)

Where, f(x) represents Jacobian matrix of encoder f at x. In case of D-CAE, Eq. 7 can be formulated as for n CAE.

$$\begin{aligned}&JCAE (\theta ) = \sum _{i=1}^{v} \sum _{x \epsilon D_i} R(x,r)\nonumber \\&\quad \,+\,1/2(\lambda ||f(x)||) 2/F \end{aligned}$$
(9)

The complete of structure of the conventional AE is given in Algorithm 1, while the algorithmic structure of proposed model is shown in Algorithm 2. The overall workflow and network architecture of the proposed model is shown in Fig. 1. There are three layers in the proposed model, each layer is itself a single CAE. In Fig. 1, the internal architecture of first layer is divided into encoder and decoder. Separately layer 2 and layer 3 are also CAEs, having the same architecture as layer 1 followed by the Softmax layer that is last layer for the purpose of final classification. In order to minimize the reconstruction error, the optimization has been applied to each layer of CAE based on Equation 6.

4 Experimental results

In this section, we conducted the experiments of the proposed D-CAE and other comparative models. All of the considered models for this experiments are evaluated on 5 benchmark variant datasets of MNIST. 5000 images from each of MNIST variant-dataset are randomly selected to validate the proposed D-CAE model. The variation datasets of MNIST are small subset (basic), random rotation digits (rot), random noise background digits (bg-rand), random background digits (bg-img) and rotation & image background digits (bg-img-rot) [28]. 6 random images from each MNIST variant dataset are shown in Figs. 2, 3, 4, 5 and 6 respectively. The experiments were performed on all of the above mentioned datasets with two phases in order to validate the D-CAE using different ratios of training data and testing data. In first the phase of experiments, the data has been splitted in 70% training and 30% testing. While in the second phase of experiments, 50% of the data is selected for training and 50% for testing purpose. The output of experiments is mapped in the form of confusion matrices (CM) and receiver operating characteristic (ROC) curve graphs

figure b

.

Fig. 2
figure 2

MNIST small subset (basic)

Fig. 3
figure 3

MNIST random rotation digits (rot)

Fig. 4
figure 4

MNIST random noise background digits (bg-rand)

Fig. 5
figure 5

MNIST random background digits (bg-img)

Fig. 6
figure 6

MNIST rotation and image background digits (bg-img-rot)

4.1 Training procedure

This section describes the training method and heuristics of the D-CAE model with details. The D-CAE is consists of three layers namely input layer, hidden layer and the reconstruction layer. The input layer takes the whole input feature set. The hidden layer performs the internal processing of the auto encoder whereas the reconstruction layer maps the target output to the fed inputs. The features set obtained from the reconstruction layer are then given as inputs to the classifier (mainly softmax layer) for classification. the reconstruction of inputs from the original inputs works in unsupervised manner whereas the classification works in supervised fashion. We train our model using a feedforward mechanism. There are a few parameters that we adjust for every variant of MNIST dataset. In image processing, it is costly and memory consuming if the algorithm operate over the whole image pixels at once, due to that reason, it is more convenient to split image into patches. A single patch is rectangular or square piece of an image. For example, a 10 \(\times \) 10 patch contains 100 pixels. The second technique applied in image processing is filtering. It refers to any enhancement or modification in image for the purpose of emphasizing some features or to remove others. These filters replace each pixel’s intensity value by a weighted mean of the neighboring pixels. Third parameter to the proposed approach is aperture. generally, the aperture size refers to the intensity resolution and sensitivity of image. With the decrement of aperture size, the resolution intensity improves and the sensitivity of image decrease which ultimately reduces the image processing cost. These parameters, the patch size, the number of layers and filters, were determined by using 10-fold cross validation. The values of parameters based on dataset variant are shown in Table 1. During our experiments, we kept the patch shape to square and for learning filters, we randomly select 1000 patches from each layer. Learning based on filters is a crucial step of preprocessing and can be expressed by Eq. 10.

$$\left\{ {\begin{array}{*{20}l} {Z_{i} = U_{i} \sum\nolimits_{i} {VT_{i}^{{}} } } \hfill \\ {U_{i} = [u_{1} ,u_{2} ,u_{3} \ldots ,u_{p}^{2} ]} \hfill \\ {W_{i} = [u_{1} ,u_{1} ,u_{1} , \ldots u_{k} ]} \hfill \\ \end{array} } \right.$$
(10)

In Eq. 10, \(Z_i \epsilon R^{p^{2\times 1000}} \) denotes collection of patch matrix having 1000 patches as a vector for ith layer where \(i=1,2,3\). \(V_i \epsilon R^{1000\times1000}\) and \(U_i \epsilon R^{p^{2}\times p{^2}}\) denotes unitary matrices. \(W_i \epsilon R^{p^{2} \times k}\) while \(k<p^{2}\) denotes a matrix of filters collection where k is the number of filters. All the filtered features are penalized with variance 1 and mean 0. The aperture for each of the extracted matrix is adjusted empirically at interval of [0,3] having a value that is determined by cross-validation. Parameters for each MNIST variant-dataset are shown in Table 1.

Table 1 D-CAE Parameters for all MNIST variant-datasets
Fig. 7
figure 7

CM & ROC for MNIST small subset (basic) with 70:30 training and testing data

Fig. 8
figure 8

CM & ROC for MNIST small subset (basic) with 50:50 training and testing data

In these experiments, the core D-CAE model leverages a softmax classifier to find out the out the overall classification behaviour of D-CAE based image classification. Figures 7a, b and 8a, b shows the output confusion matrix and roc curve for 70:30 and 50:50 training-testing ratios respectively. In confusion matrices, it is clear that the proposed model shows better results for 70:30 as compared to 50:50 training-testing ratio on MNIST basic dataset. The highest accuracy is observed for class 1 followed by class 6 and class 0 because their features were more distinct as compared to other classes. On the other hand, the lowest accuracy can be seen for class 8 followed by class 9 and class 5, as these classes were misclassified to one another because of high similarity in features.

Fig. 9
figure 9

CM & ROC for MNIST random rotation digits (rot) with 70:30 training and testing data

Fig. 10
figure 10

CM & ROC for MNIST random rotation digits (rot) with 50:50 training and testing data

From Figs. 9a and 10a we can observe also that our proposed model give better results for 70:30 training-testing ratio on MNIST random rotation digits (rot) dataset. the highest accuracy is gained by class 1 and the lowest accuracy remains by the class 5. Also the ROC curve in Figs.9b and 10b resembles the same results for this MNIST variant dataset.

Fig. 11
figure 11

CM & ROC for MNIST random noise background digits (bg-rand) with 70:30 training and testing data

Fig. 12
figure 12

CM & ROC for MNIST random noise background digits (bg-rand) with 50:50 training and testing data

Fig. 13
figure 13

CM & ROC for MNIST random background digits (bg-img) with 70:30 training and testing data

Fig. 14
figure 14

CM & ROC for MNIST random background digits (bg-img) with 50:50 training and testing data

Fig. 15
figure 15

CM & ROC for MNIST rotation and image background digits (bg-img-rot) with 70:30 training and testing data

Fig. 16
figure 16

CM & ROC for MNIST rotation and image background digits (bg-img-rot) with 50:50 training and testing data

Figures 11a, b and 12a, b shows that the accuracy of D-CAE for random noise background digits (bg-rand) dataset is also approximately above than 90% expect Class 1 and class 9. Similarly Figs.13a, b and 14a, b presents the experimental results for MNIST random background digits (bg-img) on 70:30 and 50:50 training-testing data ratio. the results are better for all classes except class 1, which is below 90%.

Figure 15a, b are also showing better result but there is some variation in accuracy. The second phase of experiments is based on splitting that dataset in 50% of training and 50% of testing data. Moreover, The results for MNIST rotation and image background digits (bg-img-rot) are shown in Figs. 7a, b and 16a, b. It is shown that the experimental results follow a minor decrement in the accuracy starting from MNIST small basic dataset to most complex MNIST rotation and image background digits dataset. Then, the gradual increment in the dataset complexity is related to a decreased accuracy.

5 Comparative analysis

In this section, it is evaluated that the capability of feature learning under the following conditions: (a) minimal time complexity, (b) better accuracy. To verify and validate our D-CAE model, we performed several experiments. All the experiments were carried out on Intel core i7 cpu with 8GB of RAM having windows 10 operating system. The compiler and language used for developing and testing these algorithms is python3.6. For rapid development of the D-CAE model, we used keras [29]. Keras is python library for deeplearning based on a fast numeric computational base-library with high performance called Tensorflow [30]. Tensorflow allowed the easy implementation based on both CPU and GPU support. The benchmark variation datasets of MNIST are used for evaluation. MNIST is a handwritten digit images dataset contain 70,000 images from 0 to 9. Each image in this dataset has a size of 28 \(\times \) 28 pixels. We split our experiments in two phases based on training and testing data ratio for each dataset. In first phase of experiments we split the data into 70% training and 30% testing dataset, that is 49,000 training and 21,000 testing. While in the second phase 35,000 images are considered as training set and 35,000 for testing. We also used 5000 images as validation for each of the MNIST variation dataset as discussed in Sect. 4.

5.1 Running time comparison

Besides the different computational behaviour of different algorithms, All the experiment were carried out on same hardware and software architecture as mentioned in Sect. 5. Still some variance is obvious concluded from results presented in Table 2. This summarizes the runtime for training of different competitive models from literature based on MNIST variant-datasets. Table 2 shows the clear observation of the time taken by iterative based methods training i.e. RBM [31], DAE [32], SAE [33] and SCAE [17], is few times greater than that of non-iterative based methods i.e. PCANet [34, 35] and ScatNet [36]. The different parameter settings affects the runtime complexity of the aforementioned models. Nevertheless, the proposed D-CAE outperform non-iterative methods and iterative based models with the same parameter tuning.

Table 2 Running time comparison of different models on MNIST variant-datasets

5.2 Digital recognition on MNIST variation-datasets

The results in Table 3 show the evaluation of our proposed D-CAE model. We did not use the backpropagation in our training, which makes our training speed faster than the other models based on non-iterative mechanisms. Moreover, the training time of our model is mostly used in image representation learning but we used low level feature representation that reduces this time complexity as well as we can further reduce it by increasing the capacity of memory. Because memory consumption is directly proportional to representation learning, so by extending the memories will enhance the representation learning speed. In addition we can apply some principal component analysis function in a parallel behaviour with after memory extension to boost the training phase. This property of our model is also a key to move towards big data applications. In the comparative analysis of our model, some state-of-art methods are used. Some of them are non-iterative in nature e.g. PCANet [37, 38] and ScatNet [39] while some are iterative e.g. DAE [32], RBM [31], SAE [13] and SCAE [17]. Table 3 provide a conclusion of results that the proposed D-CAE model is moving towards better performance as the dataset is getting more complex and increase in size, as it can be seen in the last and second last datasets, our ranking is high where the MNIST variant-datasets are more complicated. The significance of our model is not clearly proved by the basic MNIST, because the latter is standardized. On the other hand, the rest of MNIST variant-datasets are more complex due to rotations, background randomizations and the presence of background images. Hence, the experimental results in Table 3 explains the significance of our model in complex data environment.

Table 3 Training accuracy of different models on MNIST variant-datasets

5.3 Classification based comparison

In addition to the analysis of our proposed model based on learning error rate, some experiment are conducted for the analysis of testing error rate by using different models and softmax classifier in the last layer. We train and test all of the experimented models using the same architecture for pre-classification steps. Table 4 list all the models with their classification error rates. All of the results classification models discussed in Table 4 is the mean value for 5 times repeated experiment. The final results conclude that D-CAE with Softmax classifier outperforms ScatNet, PCANet, RBM, DAE and SAE with softmax classifier.

Table 4 Parameter setting

In order to keep the inner-class variance of extracted features from being 0, The ScatNet model’s results are generated regardless the threshold value. In some cases the variance of few class feature become 0, we applied normal fit using normal distribution for each of that class. Table 4 shows some interesting results for PCANet, the most usable model having non-iterative behaviour [40] did not show any effective results here. However, it outperforms well-known RBM based classifier usually does not perform well if the data size is increasing in training [41]. The rest of the aforementioned models performed almost similar to their variants performance based on the same accuracy rate. By considering the multi-dimensionality extracted features for some of the non-iterative methods is decisive. But classifier based on models like SAE, performs well in high-dimensional feature space.

Besides the other classifiers in our experiments, Softmax classifier performed well. The Softmax regression classifier is based on an iterative algorithm, for this reason it is provided an iterative solution implementing the Limited-memory Broyden-FletcherGoldfarb-Shanno (L-BFGS) [42] algorithm. We adjust the iteration count to 500 and \(\gamma \) is 10-4, which is weight decay term coefficient. These parameters settings are provided by many researchers in literature [43, 44]. Due to feature dimensionality reduction property with simple architecture and easy way of implementation, the proposed model with Softmax classifier outperformed the state-of-art models as listed in Table 4.

Table 5 Parameter setting

In addition with the training and testing accuracies in Tables 3 and 4, and 5 provide the more explanatory performance evaluation metric called precision and recall. Precision refers to the ratio of positive observations that are correctly classified to the overall positive classified instances. Recall is the ratio of correctly classified positive instances to the all actual class instances. Generally both precision and recall are meant for the purpose of binary classification therefore, we calculate these measures for each class separately and then take the average of all classes in order to get the final precision and recall values. Table 5 shows the precision and recall measures comparison of the proposed model with other state-of-art models.

6 Conclusion

This paper proposed a fast and simple architecture based D-CAE for feature reduction and abstract representation learning in image classification. The research work presented in this manuscript is concluded with three main contributions: it used the CAE for learning high-level features without using backpropagation scheme. Softmax classifier is used as the last layer of trained D-CAE for classification and lastly, we performed comparative experiments in order to prove the significance of our model on MNIST variant-datasets as it is described in Sect. 5. The experiments show that our model has produced effective results compared to state-of-art classification algorithm on relatively complex datasets. The simplicity of proposed D-CAE architecture provides a valid base for the implementation and experimentation. Even if, as it is shown in Sect. 4, when the data is getting more complex; D-CAE outperforms the comparative models in feature learning, classification as well as in time complexity. As a conclusion, D-CAE performed better on complex MNIST variants datasets. Therefore, we will focus our future researches on the evaluation of more complex datasets, such as: Toronto Face Detection (TFD) and Canadian Institute For Advanced Research (CIFAR).