1 Introduction

Content based image classification (CBIC) has emerged as a research theme of importance due to escalating application of image data in assorted domains including medicine, entertainment, education, defence, etc. [1]. Radical improvement in accuracy has been observed in object recognition systems with the advent of various low level features [2]. Nevertheless, the efficiency of these systems turns out to be questionable in case of large image datasets because of the high computational overhead and elevated storage requirements. One of the reasons for the above mentioned issues has been the dimension of feature vectors extracted to represent an image. The hand-crafted feature extraction techniques in contemporary literatures have considered extraction of image features by manipulating the entire image surface [3]. However, the whole image may not be necessary to create a distinct feature vector to effectively sample the image categories using fine-grained features [4]. Literature suggests the effectiveness of image blocks for local feature extraction. Significant performance enhancement in local invariant face recognition has been observed by adaptive selection of image blocks [5]. Hence, it has become imperative to locate the region of interest (ROI) for extracting useful features to facilitate effective CBIC. Image data is growing on a daily basis, and considering the aforesaid inflation, it is essential for image-based real time applications to reduce the memory requirements and computational expenses. The authors have attempted to utilize partial image for feature extraction by including those image blocks having significant contribution in offering credible features. The process of insignificant image blocks removal has been carried out before feature extraction which has considerably reduced the computational overhead during feature extraction and has also contributed in feature dimension reduction. A recent work [6] has readily addressed the issue by modelling a technique of error histogram analysis of image partitions via an m/p/m autoencoder. The autoencoder has been chosen to be a shallow one, (where, p < m) in order to identify relevant image partitions portraying higher amount of reconstruction error. This facilitates the identification of insignificant partitions which do not have much involvement in image identification process. The technique has primarily extracted features from all partitions of a segmented image and has further identified features extracted from insignificant image segments which will not contribute considerably to image recognition process [6]. Features of such insignificant segments are discarded and the remaining features are used for retrieval of medical images. However, to the best of our knowledge, removal of insignificant partitions before feature extraction by calculating reconstruction error using sparse autoencoders has been first introduced in this paper and is not being carried out in the aforesaid work. Also, as opposed to [6], where intra class image retrieval problem is addressed, for the first time, we introduce sparse autoencoder based framework for feature vector dimensionality reduction in the application of CBIC. We observe that the classification accuracy with features extracted from the entire image without segmentation has been much less compared to the novel approach of significant partition selection for feature extraction. Dimension of feature vector also gets reduced as it is extracted only from chosen partitions and not from the entire image. It is observed, that selection of partitions based on reconstruction errors using sparse autoencoders has significantly contributed in enhancing classification accuracy. Gradual lessening of the number of partitions for feature extraction with reduced feature dimension has attenuated the classification accuracy, but the level of accuracy still remains higher compared to classification accuracy with features extracted from images with no partitions. In a similar fashion, reduction in dimension of fused features for partitioned images has outperformed the classification accuracy of fused features for their counterparts with no partitions. Thus, the soundness of the proposed approach lies in the consistency in classification accuracy even after ensuring decreased feature dimension. The scientific rigor of the approach has been established by testing the proposed technique with two public datasets namely, Wang dataset and Corel 5K. The proposed approach has exhibited similar behaviour when scaled up with 5 times increased number of images in Corel 5K compared to Wang dataset. Two different feature extraction techniques, i.e. Uniform Local Binary Patterns (ULBP) [7] and colour histogram [8] of RGB image are used to extract features from relevant partitions as well as randomly selected partitions of segmented images each of which leads to different classification accuracies. Due to the reduction in number of partitions for which features are extracted, it becomes easier to fuse more relevant features such as colour histogram with minmax normalization [9] and create a fusion based feature input to the classifiers. As expected, the fusion based approach has revealed superior classification accuracy compared to any of the individual approaches. The proposed technique has also surpassed classification accuracy recorded for two state-of-the-art techniques of feature extraction, viz., Histogram of Oriented Gradients (HOG) [10] and Scale-invariant feature transform (SIFT) [11]. This has established the partition based method to be of general interest to increase the classification accuracy. Two different datasets, namely, Wang Dataset (1000 images with 10 categories) and Corel 5K Dataset (5000 images with 50 categories) are used for evaluation purpose. Thus, on the whole, 6000 images are considered for the experimental work.

The objectives of this work can be summarized as follows:

  • Enabling the use of sparse autoencoders to identify significant image partitions for dimension reduction of hand-crafted features.

  • Formation of fusion based feature vector with features extracted using two different techniques.

  • Using Support Vector Machine (SVM) and Extreme Learning Machine (ELM) for content based image classification with reduced features as well as fused features.

  • Comparison of classification performance of proposed technique with reduced and fused feature sets.

  • Comparison of the proposed approach with state-of-the-art techniques.

This paper is organized as follows. Background and conceptualization of the research work has been discussed in the Introduction part. Contemporary literature survey has been carried out in Sect. 2. The proposed framework for feature extraction is explained in Sect. 3. Classification results have been analysed and compared in Sect. 4. Finally, the research outcomes have been discussed in Sect. 5.

2 Related work

CBIC has embraced widespread literature that includes varied approaches over last 20 years [12]. Present work has introduced a novel approach of selecting the ROI in an image for feature extraction using sparse autoencoder. Therefore, the paper is in correlation with visual feature extraction from image datasets for CBIC and also with the role of autoencoders to assist content based feature extraction. The work also facilitates fusion methodologies as it demonstrates the usefulness of a feature ensemble technique for increased classification accuracy along with dimensionality constraints. The following subsections contain survey of some of the relevant works carried out in the aforesaid areas.

2.1 Classification technique

Supervised classification utilizes the feature vectors extracted from image as input data. Researchers have developed different techniques based on Artificial Intelligence (Logical/Symbolic techniques), Perceptron and Statistics (Bayesian Networks, Instance-based techniques) [13]. The logic based algorithms are based on decision trees and rule based classifiers [14]. The decision trees sort the feature values of the instances for classification purpose. The rule based classifiers represent each class by disjunctive normal form (DNF) in classification rules. Perceptron-based techniques are either single layered perceptrons or multi-layered perceptrons. The single layered perceptron has connection weights/predictors to convert the input feature vectors to weighted inputs which are evaluated on the basis of an adjustable threshold which is maintained for the output predictions. The perceptron-like methods are binary which convert a multiclass problem to a set of multiple binary classification problems [15]. Multi layered perceptrons constitute of multi-layer neural network by joining a large number of units (neurons) together in a pattern of connections. It comprises of three different units including input units, for receiving the information meant for processing; output units, to divulge the results of processing; and hidden units, which remain in between the input and the output units. A Feed-forward network permits one way movement of signals, from input to output. An emerging learning algorithm named Extreme Learning Machine (ELM) has been extensively used for the generalized single hidden layer feed-forward neural networks [16]. The significance of ELM learning architecture is the random generation of the hidden node parameters and analytical computation of the output weights. The usefulness of ELM has been further stretched to kernel learning which has revealed the suitability of ELM for an extensive variety of classification. The statistical models are based on an explicit underlying probabilistic approach which signifies a probability that an instance belongs in each class, instead of classifying it merely. Another contemporary classification technique is Support Vector Machine (SVM) that introduces the notion of a “margin” which in turn signifies either side of a hyperplane to differentiate between two different classes [17]. Derivation of an optimum separating hyperplane can be carried out by minimizing the squared norm of the separating hyperplane for linear separation of two classes.

2.2 Feature extraction

Efficient feature extraction is fundamental necessity for the success of CBIC. The high level representation of image data can be achieved by means of feature extraction from low-level data (pixels). Fundamentally, the visual features are arranged in three different hierarchies, namely, low-level features (primitive), middle-level features (logical) and high-level features (abstract) [18]. The low-level features comprises of texture, colour, shape etc. which has been considered by most of the early systems for image classification. Recently, a surge of activity is witnessed for the development of mid-level features involving sub-images, bag-of-words approach etc. and high-level image features comprising of semantics. It is essential for the features to express the image content correctly for proper classification outcome. If the features are unable to signify the image content correctly then images can hardly be categorized.

Colour is considered to be one of the essential and meaningful features that are widely used in CBIC and object recognition [19]. Number of descriptors has been designed for extraction of colour features. Occurrence of colour in an image can be well expressed by colour histogram [20]. Significant colour distribution in the ROI can be described by means of Dominant colour descriptor [19]. Probability of locating a pair of colours based on a specified distance is depicted by colour correlogram [21, 22]. Colour moments have revealed better image identification compared to conventional methods [23]. Global features are popularly extracted by building a grey level histogram consisting of 256 bins. However, the main shortcomings of histogram based technique are probable allocation of similar colour intensities to different bins and lack of any spatial information. Calculation of local histogram has also been performed by splitting the images into partitions to overcome the aforementioned issues. Another important feature of an image is the texture feature which has a vital role for feature extraction in CBIC. The different forms of texture feature extraction techniques are of statistical, geometrical or model based nature. The statistical way of representing textures is characterized by grey-level co-occurrence matrix (GLCM), Markov random field Gabor wavelet, edge histogram descriptor (EHD) etc. [24, 25]. A realistic approach to signify the grey level textures by utilizing patterns is carried out by implementing Local Binary Patterns (LBP) [26]. The low computational overhead of LBP along with its invariance to resolution changes has enabled it to become the state-of-the-art texture descriptor.

2.3 Autoencoders

Decoding and encoding of inputs can be carried out with minimum error by means of a special type of neural network known as autoencoders [27]. An autoencoder is trained to encode an input into some kind of representation which in turn is capable of reconstructing the same input from that representation [28]. The inputs are distributed into an arrangement of hidden layer activations followed by the successful training of an autoencoder. This process is termed as feature learning for the autoencoder. The dimension of feature is equivalent to the number of neurons in the hidden layer. A sparse autoencoder adds a penalty to the error function which ensures activation of each hidden neuron in response to only a certain type of input, not all of them.

2.4 Fusion techniques

Feature fusion enables the ensemble of diverse features extracted by means of different feature extraction techniques. Fusion of complementary features leads to enhancement of classification rate and is categorized into four different categories; early fusion, late fusion, hybrid fusion and intermediate fusion. The classifier is provided with a single input by early fusion as it combines the features of different techniques before learning. Late fusion enables decision fusion of the classification decisions obtained from different classifiers for each feature extraction technique. The two aforesaid techniques are represented in a combined form by the Hybrid fusion technique. Multiple features are integrated by means of Intermediate fusion by considering a joint model for decision to yield superior prediction accuracy [29]. Several techniques have been proposed for fusion of various complementary hand-crafted features such as in [30,31,32,33,34,35,36,37,38,39,40,41].

Following observations are made based on the survey of the contemporary literature:

  • None of the techniques except in [5] have considered feature extraction from a ROI in an image.

  • Although improvement in classification accuracy is observed with fusion based techniques, but increase in dimension of features due to fusion has not been dealt with.

  • Large feature size due to above two limitations has contributed in increased computational overhead for classification.

The authors have readily addressed the observations made in the literature review and have designed a method to handle the aforesaid issues. The novel method has outperformed the traditional approaches and has revealed improved outcomes.

3 Proposed framework

We introduce a threshold based partition selection technique by means of a shallow autoencoder [42]. We prefer the autoencoder guided approach of selective removal of small image partitions which are not relevant for feature extraction process and will not significantly contribute to image classification. It helps with reduction of feature dimension. The autoencoder comprises of generic m/p/m architecture for encoding m inputs into p positions and then further reconstructs m outputs back from p positions. A shallow network signifies the use of an autoencoder with p < m which enables the autoencoder to compress the input to reduced dimension.

A recent approach performed fast intra-class medical image retrieval using significant image partitions [6]. Initially, each medical image has been divided into several partitions and in the training phase of classification, LBP features have been extracted from all image partitions and have been stored in feature matrix. Reconstruction error has been computed for all the partitions and the errors have been stored in an error histogram for each class. During testing phase, LBP features have been computed for the testing image’s partitions which are then classified by SVM. Based on their predicted class and the corresponding error recorded in error histogram, the partitions have been either discarded or retained for intra-class image retrieval task. Here, we extract 59 ULBP features to encode image texture. However, in contrast to the existing technique, we eradicate the irrelevant image partitions before feature extraction and thus prohibit their use in the feature extraction stage itself. We never extract features from all image partitions which minimizes the computational overhead of our classification method. Classification is carried out with SVM and ELM both for performance comparison. The reduction in number of partitions for which features are computed, allows for a feature fusion based approach for CBIC. Therefore, colour histogram based features are also extracted from the significant image partitions in addition to ULBP features, and are used in classification. Feature extraction using colour histogram is carried out with RGB images where the dimension of extracted features is dependent on the number of input bins for each colour component. However, the number of bins must be same for all three colour components. The dimension of extracted features is \(nbins^{3}\) where nbins denote the number of input bins and 3 is the number of colour components. In this work, 4 bins are chosen for each colour component resulting in feature dimension of 64.

Each image in a dataset is divided into n × n partitions of equal dimensions. We use a shallow network with m/p/m autoencoder to locate the image partitions that exhibit high error for decoding. The process of selecting the partitions is carried out by comparing the reconstruction error of the autoencoded partitions. Thus, the error turns out to be useful to locate the partitions relevant for feature extraction in a particular image. Image partitions with complex structures like edges, structures etc. will have high decoding error compared to image partitions having smooth regions with no significant gradient change. Classification results will have least contribution from partitions having minimum decoding error.

As mentioned earlier, the n × n partitions in an image I are to be analyzed for ROI by means of autoencoder to locate the partitions for feature extraction. A threshold is therefore set for reduction of image partitions used for feature extraction and further for classification. The threshold \(t\) is chosen such that \(t \in [0,1)\), is fraction of number of image partitions for a particular image. Therefore, the approach has minimized as many as \([t \times n \times n]\) partitions which are not useful for feature extraction. The partitions below an error threshold are removed and the rest are retained for feature extraction.

The step-by-step process for the proposed framework is given as Algorithm 1 in Fig. 1. The partitions of an image are shown in Fig. 2.

Fig. 1
figure 1

Proposed framework

Fig. 2
figure 2

Partition selection for feature extraction with reconstruction error of autoencoder

Further, we propose the use of fused features which is more feasible now with the limited number of partitions being used in classification. We have designed the fusion of features extracted from different number of selected partitions using autoencoder’s error as shown in Figs. 1 and 2. The process of fusion is depicted in Fig. 3.

Fig. 3
figure 3

Fusion of features for classification

4 Experimental results and analysis

The experiments are conducted with Neural Network Toolbox of MATLAB R2015b on a machine with Intel core i5 processor and 4 GB RAM. We have used two different datasets which are briefly described in the following subsections. Further, the reconstruction error generated by the autoencoder is discussed. Thereafter, we discuss the results obtained from various classification techniques. Finally, we report and compare the results of classification obtained from diverse feature extraction techniques.

4.1 Datasets used

4.1.1 Wang dataset

Wang et al. have provided this widely used public dataset with 1000 images [40]. Dimension of every image is 256 × 384 or 384 × 256 pixels and the images are divided into 10 different categories with 100 images in each category-Tribals, Sea Beaches, Gothic Structures, Buses, Dinosaur, Elephants, Flowers, Horses, Mountains and Food. A sample collage for Wang’s dataset has been given in Fig. 4.

Fig. 4
figure 4

Three sample images of each class of Wang’s dataset

4.1.2 Corel 5K dataset

Corel 5K is another benchmark dataset in computer vision experiments which has 50 different categories of images with dimensions: 128 × 192 or 192 × 128 pixels [43]. Some of the image categories are beer, wolf, lion, elephant, tiger, mountains, vegetables, faces etc. A sample collage of the Corel dataset is given in Fig. 5. For the sake of simplicity, the images of both datasets are resized to square sizes.

Fig. 5
figure 5

Sample collage for Corel 5K dataset

4.2 Error measurement for partition selection

Shallow neural network architecture of an autoencoder results in some level of error when the input values provided to the input neurons are compressed by the neurons in the hidden layer and are further reconstructed back into same number of output values as that of the input values. The number of grey values in each partition created by partitioning an image into nxn partitions is fed as input to the autoencoder. The number of input neurons and output neurons are equal to the number of grey values and the number of neurons in the hidden layer is lesser than the neurons in input/output layers as p < m. The value of p is chosen to be 10 in our work which is much less than the input value m. For example, the value of m is 4096 in case when the 256 × 256 Wang dataset image is divided into 16 equal partitions and is 1024 in case when the 128 × 128 Corel 5K dataset image is divided into 16 equal partitions. The reconstructed input values produced from the compressed hidden layer is available at the output layer and the error of reconstruction is measured with mean squared error (MSE) method as given in Eq. (1).

$${\text{MSE}} = \frac{1}{{m }}\sum\limits_{y = 1}^{\sqrt{m}} {\sum\limits_{x = 1}^{\sqrt{m}} {\left[ {I(x,y) - I'(x,y)} \right]} }^{2}$$
(1)

where, I(x,y) and I′(x,y) are the autoencoder’s input and output grey values of a partition. À flow-chart of feature extraction process is given in Fig. 6.

Fig. 6
figure 6

Flowchart for feature extraction process

The threshold of partition selection is initialized at the beginning and based on the higher MSE values, the partitions are chosen from a set of consecutive partitions. The partitions with lower MSE values within a set are discarded and not considered for feature extraction.

4.3 Classification techniques

Two different classifiers, namely, SVM and ELM, have been used for the purpose of CBIC with the novel approach of feature extraction proposed in this work. The classification is performed using tenfold cross validation and leave-one-out validation. For tenfold cross validation, the entire dataset is divided into 10 subsets. For leave-one-out, it is divided into N (where N is the number of images in the dataset) subsets. Training set for tenfold cross validation comprises of 9 subsets and the one remaining subset is considered as the testing set. In case of leave-one-out cross validation, the training set comprises of N − 1 subsets and the testing set is the remaining subset. The method is repeated for 10 trials in case of tenfold cross validation and for N trials in case of leave-one-out. In each of the trials, the testing and training subsets are changed in round robin fashion. The average classification accuracy obtained in 10 trials for tenfold cross validation and for N trials for leave-one out validation measures the performance of classification. The classification accuracy is calculated by averaging the ratio of number of correctly classified images \(n_{c}\) to the total number of images in the dataset \(N\), as in Eq. (2).

$${\text{Accuracy}} = \frac{1}{N}\sum\limits_{k} {\frac{{n_{c} }}{N}}$$
(2)

where, k is the number of trials, which is 10 in case of tenfold cross validation and N in case of leave-one-out cross validation.

4.4 Classification results

4.4.1 Results on Wang dataset

For the sake of simplicity, the images are resized using bi-cubic interpolation to 256 × 256, 128 × 128, 64 × 64 and 32 × 32 pixels. We divide the images into 16 partitions in each of the cases considered. Thus, each of the 16 partitions is of 64 × 64, 32 × 32, 16 × 16 and 8 × 8 pixels respectively. Initially, tenfold cross validation is carried out to measure the classification accuracy as depicted in Tables 1, 2, 3 and 4. For the purpose of comparison, classification is first carried out with ULBP features extracted from entire image without dividing it into partitions. Further, the classification is carried out by dividing the image into 16 partitions and gradually reducing the partitions for feature extraction based on the proposed framework. Further, we also compare the results with random selection of partitions. SVM classifier with Radial Basis Function (RBF) kernel is used here for the classification purpose.

Table 1 Tenfold cross validation accuracy of SVM (RBF kernel) with ULBP on 256 × 256 pixels images of Wang dataset
Table 2 Tenfold cross validation accuracy of SVM (RBF kernel) with ULBP on 128 × 128 pixels images of Wang dataset
Table 3 Tenfold cross validation accuracy of SVM (RBF kernel) with ULBP on 64 × 64 pixels images of Wang Dataset
Table 4 Tenfold cross validation accuracy of SVM (RBF kernel) with ULBP on 32 × 32 pixels images of Wang dataset

Results in Table 1 can be used to compare the tenfold cross validation classification accuracy obtained with feature vectors extracted from images without segmentation and with segmentation into 16 partitions. The classification accuracy with ULBP features extracted globally from images without segmentation is observed to be 53.3% for Wang dataset. Further, the images are segmented into 16 partitions and evaluated by extracting features from each of the partitions locally using ULBP, i.e. 100% feature dimension, which results in a classification accuracy of 80.2% using SVM classifier with RBF kernel. The same setup for classification is maintained and feature dimension reduction is carried out by selecting partition threshold \(t \in [0,1)\). Each of the 16 partitions is autoencoded and MSE is computed. Based on the threshold, nt consecutive partitions are compared, and the partitions that exhibit higher MSE values are retained. Thus, feature dimension reduction is carried out by lowering the threshold limit ranging from 0.5 to 0.125 of total number of partitions created for each image. The classification accuracies obtained for features reduced with different thresholds 0.5, 0.25 and 0.125 are 73.4, 62.6 and 52.9% respectively.

The results discussed above reveal that the classification accuracy for segmentation based feature extraction using ULBP is found to be higher when all the 16 segments are used for feature extraction and also for all threshold based segment selection ranging from 0.5 to 0.25 compared to feature extraction from image without segmentation from the whole image.

Classification performance with ULBP features extracted from 100% of the image segments (16 partitions) reduces by 6.8%, when compared to the classification performance with dimension reduction of features by 50% (8 partitions), which is further reduced by 10.8% when features are reduced from 50 to 25% (4 partitions) and drops by 9.7% when dimension is reduced from 25 to 12.5% (2 partitions).

Finally, extraction of ULBP features has been carried out by randomly selecting image partitions from 16 different partitions created in the image. The threshold range has been maintained as \(t \in [0,1)\). The accuracy values for 50, 25 and 12.5% of feature dimension generated from randomly selected blocks are 58, 50.4 and 40.8% respectively. All the accuracy values are less than the corresponding accuracy values obtained for features extracted from autoencoded blocks..

Different image dimensions 128 × 128, 64 × 64 and 32 × 32 pixels, have been considered further for the experiments to demonstrate consistency of the observations made in Table 1. Each of the different image dimensions have been divided into 16 equal partitions and SVM classifier with RBF kernel is used for classification. The results are displayed in Tables 2, 3 and 4.

An analysis of the results presented in Tables 2, 3 and 4 indicates that classification performance exhibits similar characteristics for different image dimensions except 32 × 32 pixels images. In case of 32 × 32 pixels images, the classification accuracy with segmented images is higher only up to 0.5 threshold value compared to that of image without segmentation.

Therefore, the following inferences can be drawn from the above analysis:

  • Local feature extraction from segmented image using ULBP has higher classification accuracy compared to image without segmentation.

  • Radical reduction in feature dimension by means of autoencoded error based partition selection with predefined partition threshold leads to decrease in classification accuracy, which in most cases (always with 50% dimension reduction) is still higher than accuracy observed without segmentation.

  • Feature reduction carried out with random selection of image partition has less accuracy compared to autoencoded error based partition selection as the ROI is not properly identified in case of random selection.

Henceforth, the autoencoded error based partition selection and random partition selection are carried out once again on the dataset images for feature extraction using colour histogram technique. The image dimension in this case is considered to be 256 × 256 pixels and it has been divided into 16 partitions.

Later, the features extracted using colour histogram and ULBP methods are used for classification with the ELM classifier using RBF kernel. Also, the features extracted with two different feature extraction techniques are fused using minmax normalization and provided as input to SVM and ELM classifiers. In case of Wang dataset, leave-one-out validation is also carried out to evaluate classification accuracy. For the Wang dataset, the performance of SVM on individual and fused features is depicted in Table 5.

Table 5 Leave-one-out cross validation accuracy of SVM (RBF kernel) with Colour Histogram, ULBP and fused features on 256 × 256 pixels images of Wang dataset

Table 6 depicts the performance of ELM classifier using leave-one-out cross validation with individual and fused features of Wang dataset.

Table 6 Leave-one-out cross validation accuracy of ELM (RBF kernel) with ULBP, colour histogram and fused feature son 256 × 256 pixels images of Wang dataset

4.4.2 Results on Corel 5K dataset

We perform tenfold cross validation on the bigger Corel 5K dataset using the individual and fused features. For the sake of simplicity, the images are resized using bi-cubic interpolation to 128 × 128 pixels. The results obtained with SVM and ELM classifiers, both with RBF kernel, are presented in Tables 7 and 8 respectively.

Table 7 Tenfold cross validation accuracy of SVM (RBF kernel) with ULBP, colour histogram and fused features on 128 × 128 pixels images of Corel5K Dataset
Table 8 Tenfold cross validation accuracy of ELM (RBF kernel) with ULBP, colour histogram and fused features on 128 × 128 pixels images of Corel 5K Dataset

The results in Tables 7 and 8 reinforce the performance pattern of ELM and SVM observed with Wang dataset in Tables 1 through 6. In general, ELM classifier outperforms SVM in terms of classification accuracy.

Therefore, with our experiments, we analyse that the RBF kernel exhibits superior performance with ELM classifier in comparison to SVM classifier. Also, dimension reduction facilitates the fusion of colour histogram and ULBP features for classification.

4.4.3 Comparison with state-of-the-art techniques

Finally, the proposed technique of feature extraction based on partition selection has been compared with respect to two different baseline algorithms; viz., Histogram of Oriented Gradients (HOG) [10] and Scale-invariant feature transform (SIFT) [11]. The experiment is carried out with Wang dataset. Different parameters of comparisons have been considered including comparison of feature dimension, classification accuracy, individual feature extraction and classification time as well as total time consumed for feature extraction and classification. Each of these comparisons is documented in Table 9.

Table 9 Comparison of proposed technique with State-of-the-art techniques for Wang Dataset images of 256 × 256 pixels
4.4.3.1 Feature dimension

Comparison of feature dimension shown in Table 9 has revealed maximum feature dimension for HOG (Cell size = 32) features compared to all other feature extraction techniques (SIFT, ULBP, Colour Histogram and Fusion). SIFT has feature dimension much smaller compared to HOG (Cell size = 32/64), ULBP and Colour Histogram computed for images with 16 and 8 partitions and also for colour histogram of images with 4 selected partitions. HOG (Cell size = 64) has lower feature dimension compared to ULBP, Colour Histogram and fused features computed for images with 16 partitions. As expected, all the fused features for images having 16 and 8 selected partitions have larger dimension compared to SIFT and HOG (Cell size = 64) features. However, ULBP feature dimension for images with 4 selected partitions and dimension of ULBP, Colour histogram and fused features extracted from 2 selected image partitions is small compared to both the state-of-the-art HOG and SIFT.

4.4.3.2 Classification accuracy

Table 9 also shows comparison of the classification accuracy of the proposed and state-of-the-art feature extraction techniques. Two different classifiers, viz., SVM and ELM have been used for the classification purpose. For the proposed technique 16, 8, 4 and 2 partitions are considered. For 16 and 8 partitions, ULBP, colour histogram and fused features have higher accuracy than all the state-of-the-art techniques i.e. HOG (cellsize = 32/64) and SIFT with both the classifiers. In case of 4 partitions, classification with SVM and ELM has higher accuracies for ULBP, colour histogram and fused features compared to SIFT. But classification with HOG (cell size = 64) features has outclassed the 4 partitions ULBP and colour histogram features for SVM in contrast to that of ELM. Though, the fused feature (only with dimension 492 much less than HOG) for 4 partitions has shown greater classification accuracy compared to both the state-of-the-art techniques.

ULBP, colour histogram and fused features extracted only from 2 selected partitions of images underperform as compared to HOG (Cell size = 32/64) features. However, they still outperform SIFT.

The above analysis has indicated that in most of the test cases, the proposed method has surpassed the classification accuracies with both the state-of-the-art feature extraction techniques, viz., HOG (Cell size = 32/64) and SIFT. However, HOG features extracted with cell size = 32 has shown superior classification results compared to proposed technique for 4 and 2 partitions. The image has 64 partitions in case of HOG cell size = 32 which is four times the maximum partitions of 16 made in images for proposed technique of feature extraction. Yet, the proposed technique have surpassed classification accuracy with HOG (Cell size = 32) in several occasions. Further, when cell size for HOG feature extraction is increased to 64, the number of image partitions comes down to 16 which equals to the maximum number of image partitions made in the proposed technique. It has been observed that classification results of proposed technique of feature extraction (i.e. fused features) from 16 image partitions which is equal to that of HOG (Cell size = 64) as well as from 8 and 4 image partitions which are half and one-fourth respectively compared to that of image partitions for HOG (Cell size = 64) achieves better performance.

Therefore, the efficacy and robustness of the proposed method have been well established and found to be superior compared to that of the state-of-the-art techniques.

4.4.3.3 Comparison of speed

Table 9 also shows comparison of the time consumed for feature extraction, classification and total time by different techniques. SIFT has been used to extract features from images without partitions, whereas, ULBP and Colour Histogram have been used to extract features from selected image partitions as per the proposed framework. SIFT is the costliest in terms of the feature extraction time, however, HOG takes minimum feature extraction time amongst all the individual techniques. It is worth mentioning here that for the case of 16 partitions, since no selection is made the feature extraction time for ULBP is much less as compared to ULBP feature extraction for 8, 4 and 2 partitions, where feature extraction time includes partition selection as well. There is a general trend in decrease of feature extraction times for ULBP, Colour Histogram and fused features going from 8 to 4 to 2 partitions.

Comparison of classification time consumed with two different classifiers, viz., SVM and ELM is also shown in Table 9. The classification process has been carried out with features extracted by the proposed approach and the state-of-the-art techniques. It is found that classification time with SVM and ELM for HOG features extracted from images is higher with respect to the classification time consumed with other features extracted using SIFT, ULBP and Colour Histogram. SIFT has been recorded to have lesser or equal classification time for ELM compared to classification time with features extracted with ULBP, Colour Histogram and fused features from 16, 8 4 and 2 selected partitions. Conversely, for classification with SVM it has been observed that SIFT has consumed greater than or equal time to that of with ULBP, Colour Histogram and fused features from 16, 8 4 and 2 selected partitions for Colour Histogram and Fusion. In accordance with [16], ELM is almost always found to be very fast at classification as compared to SVM.

A comparison of time consumed collectively during the process of feature extraction and classification is also given in the last column of Table 9. It is observed that combined time of feature extraction and classification with SVM as well as ELM using SIFT is higher than combined time for ULBP and colour histogram (with SVM and ELM respectively) from 16, 8, 4 and 2 different partitions of image. However, HOG is very inexpensive than the proposed feature extraction techniques in terms of the total time for feature extraction and classification time.

Therefore the envisioned objectives are fulfilled and results can be summarized as follows:

  • Partition selection with the proposed framework consistently outperforms random partition selection, indicating that significant image partitions are successfully identified using sparse autoencoders.

  • Due to reduced dimensions of individual features, fusion based features are successfully created and they consistently outperform individual features for content based image classification.

  • ELM with RBF kernel achieves better classification accuracy than SVM with the same kernel.

  • Dimension reduction results in smaller drop in the classification accuracy with ELM than with SVM.

  • The proposed approach has outperformed the state-of-the-art techniques in terms of classification accuracy.

5 Conclusion

Image data has an unquestionable influence in current context in almost every real-world applications. Plethora of research work is conducted with an effort to design efficient techniques to identify the desired image information with minimum computational overhead. But most of the techniques yield hefty feature size to represent the corresponding image which in turn makes the identification process time consuming. Moreover, entire image is considered for feature extraction which may not be essential for effective feature extraction. This paper identifies the limitations of the existing systems and has presented a novel methodology to locate the region of interest (ROI) in an image for extraction of feature vectors. Identifying the ROI drastically reduces the feature size with least impact on accuracy for content based image classification. The authors have innovatively used sparse autoencoder to identify the significant image regions by comparing reconstruction errors of different image partitions, which leads to feature vector dimension reduction. Experiments are conducted for different image dimensions and with two popular classifiers, namely SVM and ELM. Early fusion of feature vectors extracted using two different techniques boosts up the classification results with reduced feature dimensionality. The proposed approach has also outperformed the classification performance with features extracted by the state-of-the-art techniques such as HOG and SIFT. The framework presented in this work is useful for real time applications that call for resourceful management of image data and its content based classification.