1 Introduction

Object classification has become a challenging task in the area of image processing and computer vision (CV) from the last few decades [6]. CV domain is widely used in the current era of automation and visual surveillance for the detection and classification of different objects in a diverse environment [3, 19]. The diverse environments are pedestrian tracking, disease detection & classification, action recognition, gait recognition, video tracking, person re-identification, optical character recognition and automatic object detection in autonomous vehicles [9, 44, 50, 54, 56]. Enormous work has been done for object classification [58] but several challenges still exist such as different color schemes, orientation and complex background, and different sort of feature extraction [22].

Multiple techniques have been adopted to overcome these listed challenges and improves classification performance. Researchers introduced various methods for the classification of different objects under complex background. They mostly focused on reliable feature extraction techniques for classification [52]. Different features have been used in traditional approaches such as Histogram of Oriented Gradient (HoG) [10], Local Binary Patterns (LBP) [52] and Bag of Words (BoW) [12]. These features are classified through different kinds of Machine learning (ML) algorithms like Linear Support Vector Machine (L-SVM) [5], Quadratic SVM [29], Cubic SVM, Naïve Bayes [36], Bayesian model, and a few more.

Recently, a new component is introduced in machine learning known as deep learning (DL) and gain significant performance in many applications like object classification, video streaming, and many more [28, 25, 18]. In DL, the Convolutional Neural Network (CNN) is a famous approach for feature extraction. The systems which are based on CNN are showing better performance as compared to classical feature extraction techniques. Several pre-trained CNN models have been introduced which are freely available to access. These models are AlexNet [21], VGG16 [37], ResNet50 [15] and InceptionV3 [46]. These models are trained over a million of different object images. A simple CNN model includes few successive layers (i.e. Convolutional, pooling, fully connected) which are used to train the model.

The more recent, the fusion of multiple descriptors in one matrix shows much interest of CV researcher for several complex challenges such as gait recognition, object classification, action recognition, medical imaging, and many more [35, 38, 20]. The fusion process increases the information of an object under different aspects like shape, color, local points, etc. This process affects the system accuracy and reported by recent researchers, the almost average 5% accuracy is increased. However, this process has few drawbacks such as system time complexity and overall computational time which is double after this process. To handle the problem of system time complexity, the features selection techniques are presented by CV researchers. As we known that the machine learning deals with diverse datasets having different data type, dimensionality, noise, redundancy and irrelevant features. The challenges increase as the amount of data and features increased. The major aim of feature selection is to reduce the noise and redundancy of features to perform operations like classification and detection robustly with less computational time and high accuracy [1]. A few famous features selection methods are- Genetic Algorithm [26], firefly [49], Entropy-controlled [2], and Crow Search Algorithm [39].

The proposed method is inspired by Rashid et al. [33] presented a fusion technique along with the best features selection. At first, they have been extracted handcrafted feature using improved saliency method and applied entropy for robust feature selection. Later, the deep features have been extracted using pre-trained deep CNN models and fused their information along with first type extracted features. The entropy-controlled selection technique is employed for the selection of best-fitted features for final classification. To follows this work, we proposed a new automated system that skips the segmentation part without losing any accuracy. Moreover, without segmentation step, our method shows better efficiency in terms of computational time.

2 Related work

Objects classification is gaining reputable status in the field of computer vision-based on famous application name intelligent machine inspection [7, 27, 34]. Many techniques are presented in the literature to tackle the problems of objects classification under the different environmental conditions like illumination, scale invariance, multifarious background, and many more. Godrati et al. [13] presented a deep learning model for 3D objects classification. The bags of features are extracted by employing spatial pyramid technique. Later, pre-trained CNN model is utilized to extract deep features which are combined and classified using Support vector machine (SVM). Weibel et al. [53] fused point features along with deep CNN features to provide better performance in rotation and noise inherited images. The presented method discriminate the objects in 3D indoor scenario. The Stanford dataset is used for evaluation of presented method and achieved better accuracy. Kaur et al. [16] implanted a computerized system to handle the problem of real-time object classification using convolutional feature map along with adaptive learning rate. The implemented model is trained on blurred and noisy data due to cluttering background. The offline model training is performed in Caltech256 dataset and achieved exceptional performance. Gill et al. [31] presented an object classification method under indoor and outdoor environment. In the implemented method, SIFT, SURF, and Tamura features which are combined and classified through SVM. This method is tested on MIT-Indoor dataset and achieved outstanding performance. Liu et al. [23] presented a deep learning-based system for objects classification. The pre-trained model is used and performs activation on middle for features extraction. The middle layers are fused along with latent features and classified through classifier. This fusion process gives exceptional performance on tested dataset.

Feature reduction and selection method introduced to select the most relevant features for classification. Many selection methods are introduced by researchers in literature which are work under specific problems [32, 41]. The selection of most important features from the original vector is a key challenge in the area of machine learning. A few famous feature selection methods are entropy-controlled, GA, Euclidean Distance (ED) based methods and a name few. Correlation and consistency based feature selection and reduction techniques performed better with supervised detection models [47]. In [24] developed a hybrid selection method by overcoming the limitation of Grey Wolf Optimizer (GWO) and Whale Optimization Algorithm (WOA) by reducing the immature convergence rate and improved selection of binary feature of objects. In the literature, the different sort of manual features has been used such as HOG, LBP, SURF, SIFT and few others. From the above studies, it can be easily summarize that sometimes, the fusion of these features provide better accuracy but it is not guaranteed for high accuracy therefore the selection process is performed to select the best subset of features to maintain the system efficiency and accuracy. The selection process also reduces the number of predictors which directly effects the classification time (Appendix Table 8).

2.1 Real-time object classification

Real-time object classification includes vehicle classification, number plate detection and object tracking. Deep learning model are used to perform real-time object detection and classification because of robustness and less computational time consumption as compared to machine learning. The implanted system is based on faster R-CNN to perform robustly in less time with higher rate of accuracy [51]. Bilal et al. [4] introduced a model to increase the speed of kernel classifier by applying soft cascade. The kernel detection and classification performance increased to detect a pedestrian from videos robustly by inclusion of corresponding features and rejection of irrelevant features. Zhi et al. [57] presented a modified CNN called LightNet to solve the 3D object detection in real-time environment. LightNet predict class and orientation labels of different 3D objects and shapes. Performance of introduced LightNet is robust on Shape Net Core55 dataset by adopting efficient training and validation techniques.

3 Problem statement and contributions

Various challenges are exist for objects classification in the static images which vitiates the system accuracy. These challenges are transparent and complex background, lighting conditions, and similarity among two or more than two objects. Several methods are presented in the literature but still they are failed to handle these challenges. Another major challenge of objects classification is an size of database which is used for training the models. As in this work, we utilized Caltech 101 dataset which includes 100 object classes. Therefore, an automated system performance is always depends on the number of selected features which are used for classification. In this work, a new automated method is proposed for objects classification using deep learning. Major contributions in this work are listed below:

  • Data augmentation is performed by horizontal flip, vertical flip and transpose operation

  • PHOG features are computed for the shape information of objects

  • CNN based features are extracted using transfer learning and fused along with PHOG

  • Select the best features by a new method name JEKNN

  • Selected features are validated on various classifiers and best results are compared with recent techniques.

4 Proposed methodology

A new method is proposed for objects classification using deep learning and classical features selection. Three-step processes is performed as augmentation, CNN and classical features extraction and fusion, and selection of best features. In the CNN features, a pre-trained model is used along with transfer learning. A complete architecture of proposed method is shown in Fig. 1. In this figure, it is shown that the augmented database is utilized to extract CNN and classical features in a parallel processing and selects the best of them before fusion stage. At the end, by using classifier the labeled data is returns as an output.

Fig. 1
figure 1

Proposed architecture for objects classification using deep learning and classical features fusion

4.1 Database augmentation

In machine learning, especially in deep learning, data augmentation is a dominant data extension method that increases the training data. Increases in training data improve the performance of deep learning methods. In the image field, the augmentation includes flipping the images, translating image pixels, and few more [48]. Previously, the manual process is used for data augmentation which needs to be automated [8].

In this work, we utilized an automated technique for augmentation of the selected dataset. As in this work, we utilized the Caltech101 dataset [14] which includes 100 object classes where each class varies the number of images. Few object classes contain less than 100 images whereas few of them carry more than 800 images. The change in a number of images make the CNN training process more complex, therefore we perform image augmentation based on a higher number of an object class. In this dataset, the higher images are 800 in one class; therefore by following this class, we equalize the other classes with the same number of images by using flipping operations. Mathematically, the performed flip operations are defined as follows:

Let we have an input image matrix of dimension 256 × 256 denoted by \( {\overset{\sim }{M}}_{i,j} \) as shown in Fig. 2 of ith rows and jth columns, where \( {\overset{\sim }{M}}_{i,j}\in {\mathcal{R}}^{i\times j} \). The rows \( \mathrm{i}=\left\{1,2,\dots \overset{\sim }{\mathrm{m}}\right\} \) and columns \( \mathrm{j}=\left\{1,2,\dots \overset{\sim }{\mathrm{n}}\right\} \) where number of channels are 3. The nature of input image is RGB that utilized for three different flip operations for augmentation.

$$ {\overset{\sim }{M}}^T={\overset{\sim }{M}}_{j,i}. $$
(1)
Fig. 2
figure 2

An example of flipped operation on image \( {\overset{\sim }{M}}_{i,j} \)

Where, \( {\overset{\sim }{M}}^T \) denotes the transpose of original image. After this operation, the indices of original image are updated.

$$ {\overset{\sim }{M}}^H={\overset{\sim }{M}}_i\left(\overset{\sim }{n}+1-j\right). $$
(2)

Where, \( {\overset{\sim }{M}}^H \) denotes the horizontal flip image.

$$ {\overset{\sim }{M}}^V={\overset{\sim }{M}}_{\left(\overset{\sim }{m}+1-i\right)j}. $$
(3)

Where, \( {\overset{\sim }{M}}^V \) denotes the vertical flip image. These three operations are performed until the lengths of images in each object class are equal to each other. An example of flipped image can be seen in Fig. 2. In this figure, it is shown that the image visualization is change after the flipped operation. However, it is also noticed that only places of ith and jth pixels are changed.

4.2 Features extraction

Feature extraction is a key step in pattern recognition for representation of an object in the image. The performance of any automated method is depends on the number of extracted features. The strong and relevant features give better accuracy but the redundant or noisy features vitiate the system performance. In this work, we extract two different type of features- Classical or well-known features and CNN based features. In the classical features, we computed Pyramid HOG [55] and Central symmetric LBP (CSLBP) [45] whereas using CNN, pre-trained model name inception V3 is utilized [46]. The detailed description of each feature type is defined below:

4.2.1 Classical features

Pyramid HOG features

We have input image \( {\overset{\sim }{A}}_{i,j} \) after data augmentation step where dimension of \( {\overset{\sim }{A}}_{i,j} \) is \( \overset{\sim }{m}\times \overset{\sim }{n} \) with ith rows and jth columns. The pixels range of image \( {\overset{\sim }{A}}_{i,j} \)is 0 to 255. To computes the pyramids of input images, convert original image into gray and defines three steps. In the first step, copy original image as shown in Fig. 3. In the second step, divide the original image into 2 × 2 layout, and in third step, further each layout of step 2 is divided into 2 × 2, as shown in Fig. 3. This process gives total 21 layouts. From these 21 layouts, HOG features are extracted. HOG features are computed in five steps.

Fig. 3
figure 3

Representation of PHOG features for object classification

First of all, perform gamma correction to improve the contrast of image in terms of illumination and viewpoint change. The gamma correction is defined by the following expression.

$$ {\overset{\sim }{A}}_{i,j}=\sqrt{{\overset{\sim }{A^{\prime}}}_{i,j}}. $$
(4)

Later, horizontal and vertical gradients are computed to further improve the weaken illumination properties by following mathematical expression:

$$ {\Delta}_x\left(i,j\right)=\overset{\sim }{A}\left(i+1,j\right)-\overset{\sim }{A}\left(i-1,j\right). $$
(5)
$$ {\Delta}_y\left(i,j\right)=\overset{\sim }{A}\left(i,j+1\right)-\overset{\sim }{A}\left(i,j-1\right). $$
(6)

By using these gradients, the magnitude and orientation is computed as:

$$ \Delta \left(i,j\right)=\sqrt{\Delta_x\left(i,j\right)+{\Delta}_y\left(i,j\right)}. $$
(7)
$$ \uptheta \left(\mathrm{i},\mathrm{j}\right)={\tan}^{-1}\left(\frac{\Delta_y\left(i,j\right)}{\Delta_x\left(i,j\right)}\right). $$
(8)

By employing gradient and magnitude information, the supreme gradient value is selected. The selection of supreme gradient is defined through a following expression:

$$ \overbrace{\Delta \left(i,j\right)}=\underset{c\in \left\{\overset{\sim }{A}\right\}}{\max}\left\{{\Delta}^c\left(i,j\right)\right\}. $$
(9)

Where, Δc represent the gradient magnitude from channel c. Later, cell quantization is performed based on neighborhood pixels where the size of number of neighbors is 8 × 8. These cells are combined in the very next step which returns a feature vector. The resultant vector is normalized in the last step by L2-Norm.

$$ \mathrm{L}2-\mathrm{Norm}:f=\frac{V}{\sqrt{{\left|\left|V\right|\right|}_2^2+{e}^2}} $$
(10)

The resultant normalized PHOG vector is denoted by ΔPHOG(N, f), where N denotes number of all testing images and f denotes extracted PHOG features.

Central symmetric LBP

Secondly, central symmetric LBP features are extracted from gray images to handle the problem of illumination changes and simplify the complexity of original extracted LBP features. In original LBP features, the central pixel is compared with all other neighboring pixels whereas in CSLBP, only compare with equal spaced pixels. Mathematically, the original LBP features are computed as follows:

$$ {LBP}_{r,n}\left(i,j\right)=\sum \limits_{k=0}^{n-1}s\left({x}_k-{x}_c\right){2}^k. $$
(11)
$$ s(i)=\left\{\begin{array}{c}1,\kern3.75em i\ge 0\\ {}0\kern1.25em Otherwise\end{array}\right.. $$
(12)

Whereas, the CSLBP features are computed as:

$$ {CSLBP}_{r,n}\left(i,j\right)=\sum \limits_{k=0}^{\left(n/2\right)-1}s\left({x}_k-{x}_{k+\left(n/2\right)}\right){2}^k. $$
(13)
$$ s(i)=\left\{\begin{array}{c}1,\kern3.75em i>T\\ {}0\kern1.25em Otherwise\end{array}\right.. $$
(14)

As compare to LBP, a CSLBP feature consumes less computational time. In LBP, 28 binary patterns are produces whereas in CSLBP, 24 binary patterns are produces for each window. The notation T works in this equation as a central pixel for generating binary patterns. Finally, the produced CSLBP and PHOG features are serially combined in one matrix as follows:

$$ {F}_{Cls}\left(N,f\right)={\left(\genfrac{}{}{0pt}{}{\Delta_{PHOG}\left(N,f\right)}{CSLBP_{r,n}\left(i,j\right)}\right)}_{N\times \overset{\sim }{\boldsymbol{f}}}. $$
(15)

Where, \( \overset{\sim }{\boldsymbol{f}} \)represent the length of combined classical features for each image and FCls(N, f) depicts the fused vector.

4.2.2 CNN features

In CNN based feature extraction step, we utilized a pre-trained CNN model name Inception V3 [46]. Inception v3 has total 316 layers and 350 connection. Further, it includes total 94 convolutional layers. In this model several filters are applied on the same layer to extract deep features. Traditional CNN layers allow network to use certain size of filter for layers. Inception flexibility allows different size of filter and different number of parameters to be applied on same layer. In this model, a convolutional filter size is 1 × 1 to extract features. A simple Inception V3 model is shown in Fig. 4.

Fig. 4
figure 4

Architecture of Inception V3 [46]

The Inception V3 model is initially trained in ImageNet database [11], therefore we copy the complete structure of Inception V3 by employing transfer learning concept and perform new training on modified augmented Caltech101 dataset. For this purpose, we divide the augmented dataset into 50:50 for training and testing. Later, train the Inception V3 on Caltech101 dataset using transfer learning. The cross-entropy loss function is utilized for feature extraction on avg_pool layer and obtained an resultant vector of dimension N × 2048. After that these features are passed to JE-KNN selection method and best-selected features are fused along with handcrafted features as shown in Fig. 5. The detail of this Figure is given below section 4.3.

Fig. 5
figure 5

Proposed classical and CNN feature fusion and reduction model for object classification

4.3 Feature selection

In the area of machine learning, feature selection is the process of obtaining the least number of strong features from original set with minimum data loss. Researchers try to find many algorithms that remove the problems of a huge amount of data into a few chunks. The high dimensional feature set increased algorithm memory, computational cost, and accommodation, significantly. For this purpose, an effective search algorithm is required that not only removes irrelevant feature but also handle the problem of redundant information. In this work, we presented a new selection method for irrelevant and redundant information reduction. A complete feature extraction and selection process is shown in Fig. 5. In this figure, the notation F1, F2, and F3 denotes the extracted feature matrix of P-HOG descriptors, CS-LBP, and Inception V3 deep CNN. The notation N denotes the total number of images utilized for training and testing. Later on, P-HOG and CS-LBP features are serially combined and passed to JE-KNN based selection method. On the same time, the deep features extracted from Inception V3 are passed to JE-KNN. The features selected by this method are fused and perform classification. The detailed of each step is given below.

As we have two extracted feature vectors name classical vector denoted by FCls(N, f) and CNN vector denoted by FCnn(N, f) where N represent number of testing samples and f represent extracted number of features. The f is defined as- f = (1, 2, …nth). Then, we implement new feature selection method name Joint Entropy along with KNN (JE-KNN). Three-step processes is follows in this method. In the very first step, the required weights are initialized where original input features are set as weights. In second step, Joint Entropy (JE) is implemented on original vector and produces a new vector which sorted into relevant and irrelevant features by employing a threshold function. In the last step, a threshold function is employed on JE obtained vector and provides to KNN classifier for loss calculation. This process is continues until, the required error rate is meet. Mathematical formulation of JE-KNN is expressed as follows:

Suppose we have extracted feature vector denoted by \( \overset{\sim }{f} \) and lengths of column in each feature vector denoted by \( \overset{\sim }{c} \) where the extracted vectors are FCls(N, f) and FCnn(N, f), respectively. Then, the joint distribution among \( \overset{\sim }{f} \) and \( \overset{\sim }{c} \) is represent as (\( \overset{\sim }{f},\overset{\sim }{c}\Big)\in \left(\tilde{f}_{i},\tilde{c}_{i}\right) \) with probability distribution is \( p\left({\tilde{f}}_i,{\tilde{c}}_i\right) \). Hence, entropy \( H\left(\tilde{f},\tilde{c}\right) \) is formulated as:

$$ \boldsymbol{H}\left(\tilde{\boldsymbol{f}},\tilde{\boldsymbol{c}}\right)=\underset{{\tilde{f}}_i,{\tilde{c}}_i}{\Sigma}p\left({\tilde{f}}_i,{\tilde{c}}_i\right)\log \frac{1}{p\left({\tilde{f}}_i,{\tilde{c}}_i\right)}. $$
(16)
$$ =\underset{{\tilde{f}}_i,{\tilde{c}}_i}{\Sigma}p\left({\tilde{f}}_i\right)p\left({\tilde{c}}_i|{\tilde{f}}_i\right)\log \frac{1}{p\left({\tilde{f}}_i\right)}+\underset{{\tilde{f}}_i,{\tilde{c}}_i}{\Sigma}p\left({\tilde{f}}_i\right)p\left({\tilde{c}}_i|{\tilde{f}}_i\right)\log \frac{1}{p\left({\tilde{f}}_i,{\tilde{c}}_i\right)}. $$
(17)
$$ =\underset{{\tilde{f}}_i}{\Sigma}p\left({\tilde{f}}_i\right)\log \frac{1}{p\left({\tilde{f}}_i\right)}\underset{{\tilde{c}}_i}{\Sigma}p\left({\tilde{c}}_i|{\tilde{f}}_i\right)+\underset{{\tilde{f}}_i,{\tilde{c}}_i}{\Sigma}p\left({\tilde{f}}_i\right)p\left({\tilde{c}}_i|{\tilde{f}}_i\right)\mathit{\log}\frac{1}{p\left({\tilde{c}}_i|{\tilde{f}}_i\right)}. $$
(18)
$$ =H\left(\tilde{f}\right)+\underset{{\tilde{f}}_i}{\Sigma}p\left({\tilde{f}}_i\right)H\left(\tilde{c}|\tilde{f}={\tilde{f}}_i\right). $$
(19)
$$ =H\left(\tilde{f}\right)+{}_{{\tilde{f}}_i}{}^E\left[H\left(\tilde{c}|\tilde{f}={\tilde{f}}_i\right)\right]. $$
(20)
$$ \boldsymbol{H}\left(\tilde{\boldsymbol{f}},\tilde{\boldsymbol{c}}\right)=H\left(\tilde{f}\right)+H\left(\tilde{c}|\tilde{f}\right). $$
(21)

A threshold function is defined on \( \boldsymbol{H}\left(\tilde{\boldsymbol{f}},\tilde{\boldsymbol{c}}\right) \)based on average value of resultant JE matrix. Through this function, those features are selected that are equal or higher than average value feature. This process is continued 50 times iterations and each time, computes the performance using KNN classifier. After 50 times, select the best accuracy features as a final selection. A Matlab function name FitKNN [43] is utilized for this purpose along with 10 fold validation. In KNN, Euclidean Distance is employed which return accuracy and error rate. Based on the error rate, we decide the best-selected vector.

$$ error=\underset{i=1}{\overset{n}{\Sigma}}{\tilde{f}}_i{\tilde{c}}_i\left\{{\tilde{a}}_i\ne {a}_i\right\}. $$
(22)

Finally, the best accuracy and minimum error rate based selected vector is provides to multiple classifiers such as linear discriminant, SVM, Ensemble tree, and cosine KNN [40]. Best on the higher accuracy, a best classifier is selected. The proposed experimental results are presented in the detailed in below section.

5 Experimental setup and results

The proposed method is validated on publically available dataset name Caltech-101 [14]. This dataset consist of total 9144 RGB and gray images of 101 unique object classes, few of them are shown in Fig. 6. Each class consists of different number of images ranging from 31 to 800. Due to both RGB and gray images, makes it challenging and difficult to perform object classification. We utilized Intel Core i7 8th generation CPU equipped with 16 GB of RAM and 8 GB GPU. All simulations are performed on MATLAB 2018a.

Fig. 6
figure 6

Sample images from Caltech-101 dataset

In the experimental process, 50:50 approach is utilized along with 10 fold cross-validation. Later, we utilized multiple classifiers and select the best one based on the high-performance rate. The performance is calculated through following measures like accuracy, computational time and False Negative Rate (FNR).

5.1 Results

As mentioned above that Caltech101 dataset is utilized in this work for experimental process, therefore we split this dataset in four groups. In the first group, we select first 25 objects classes and perform classification, then 50, 75, and all. A brief description of this process is given in Table 1. In this table, it is described that 3 experiments are performed to analyze the performance of proposed system. The main reason behind these experiments is check the efficiency, scalability and change in accuracy of the proposed system after fusion and selection of feature process.

Table 1 Number of performed experiments for classification results on Caltech101 dataset

5.1.1 Experiment 1

In this experiment, the classical features such as PHOG and Central symmetric LBP (CSLBP) are fused and perform propose selection method. The results of this process are evaluated on different number of classes as presented in Table 2. In this table, multiple classifiers are utilized such as LDA, ESD etc. In this experiment, the best classification accuracy for first 25 classes is 47.3% along with error rate of 52.7% on ESD classifier whereas on other classifiers such as LDA, L-SVM, and Co-KNN, accuracy is 41.4%, 41.2%, and 43.3%, respectively. On top 50 object classes, the best accuracy is 38% on ESD whereas the minimum is 28.4% on LDA. Further, increases in the object classes, the best accuracy is degraded and reached to 33.7% on LSVM. After all 100 classes, the best-noted accuracy is 30.2% on Co-KNN whereas the worst accuracy is 22.6% on LDA. From, the results, it is noted that, the accuracy on classical features is degraded when the number of object classes are increases. In addition, the testing classification time of each classifier against selected number of classes is also noted, shown in Fig. 7. In this figure, it is shown that the less number of object classes (25 numbers of classes) execute with minimum time whereas on all object classes, the computation time is high as compare to all others. Moreover, it is also noted that using classical features the accuracy of the system is decreases and time is increases when more number of classes are added such as 25 to 50 to 75 to all (100).

Table 2 Fusion and selection of only classical features using proposed method
Fig. 7
figure 7

Classification computation time of each classifier using classical features on different number of object classes

5.1.2 Experiment 2

In this experiment, the CNN features are fused along with classical features such as PHOG and Central symmetric LBP (CSLBP). The selection process is not performed on the fused vector in this experiment to analyze the effectiveness of proposed selection process. The results of this process are evaluated on different number of classes as presented in Table 3. In this table, multiple classifiers are utilized such as LDA, ESD etc. for classification results. The best-obtained classification accuracy for first 25 classes is 92.5% along with error rate of 7.5% on LDA classifier whereas on other classifiers such as L-SVM, ESD and Co-KNN, accuracy 88.5%, 92.2%, and 89.9%, respectively. On top 50 object classes, the best-obtained accuracy is 91.5% on LDA classifier whereas on other classifiers such as L-SVM, ESD and Co-KNN, the obtained accuracy is 88.3%, 91.4%, and 87.5%, respectively. It is noted that the fusion process maintains the accuracy after addition of more number of classes as compare to classical features. After all 100 classes, the best-noted accuracy is 87.3% on ESD classifier whereas the worst accuracy is 83.7% on Co-KNN. Overall, the results are improved after fusion process but on the end, the classification time is almost double. The classification time is also plotted in Fig. 8. In this figure, it is noted that after fusion process, the time is almost double as compared to classical features. Moreover, we also noted that the addition of more number of object classes little bit decreases the classification accuracy but on the other end, the classification time is high.

Table 3 Fusion of CNN and classical features
Fig. 8
figure 8

Classification computation time of each classifier after fusion of CNN and classical features on different number of object classes

5.1.3 Proposed feature selection

In this experiment, the proposed feature selection method is employed on fused feature vector (CNN and Classical features). The best features are selected through Joint Entropy along with KNN fitness function. The selected features are classified through multiple classifiers and numerical results are presented in Table 4. In this table, the results are presented for different number of selected classes against each classifier which are listed in Table 4. The proposed selection process results are increased as compare to Tables 2 and 3. In this experiment, the best-achieved accuracy of 25 number of classes is 93.9% which is previously 92.5% (in Table 3). This best accuracy is obtained on LDA classifier with an error rate is 6.1%, can also be shown in Fig. 9 (confusion matrix). Secondly, the classification is performed on 50 numbers of object classes and achieved best accuracy of 92.6% which is previously 91.5% (in Table 3). This accuracy is obtained on LDA and also verified through Fig. 9 (50 classes). After that, classes are increases up to 75 and results are little bit diminish. The achieved accuracy on 75 numbers of objects classes is 90.4% with an error rate of 9.6%, can be validated through confusion matrix shown in Fig. 9 (75 Classes). This accuracy is higher as compare to previous achieve performance on fused vector 87.0% (Experiment 2). In the last, all object classes are consider for classification and achieve an accuracy of 90.1% which is best as compare to both previous experiments. The achieved best accuracy is also verified through Fig. 9 (100 Classes). The classification time for all classifiers is also noted, in Fig. 10. In this figure, it is show that the proposed method is perform efficient on Caltech101 dataset. Moreover, the overall accuracy is also improved after employing proposed selection method.

Table 4 Classification results of Caltech-101 using proposed method
Fig. 9
figure 9

Confusion of matrix for proposed method results

Fig. 10
figure 10

Classification time after employing proposed method

5.2 Discussion

The brief description of proposed results, scalability of proposed method after addition of more number of object classes, effect of classification time due to number of features & object classes, and comparison with existing techniques based on accuracy are discussed in this section. As presented in Table 2-4 that three different experiments are performed based on Table 1. In the first experiment, only classical features are fused and achieved maximum accuracy of 30% on complete dataset. In the second experiment, the fused CNN and classical features without selection method and attain accuracy of 87.3% which is significantly improved after addition of CNN features. The fusion results show the worth of CNN for objects classification. In the last experiment, the proposed selection method is applied and achieved an accuracy of 90.1% with more efficiency. The proposed selection method increases the classification accuracy and reduced the computation time during the classification process.

Scalability is an important factor of any proposed algorithm. Our proposed method maintains the accuracy when more number of object classes are added which is clearly depicts from Table 2-4. But the classification time is increased due to addition of more number of classes. The classification time for each experiment is plotted in Figs. 7, 8, and 10 which show that the proposed selection method require less time for execution as compare to original classical and fused vector.

A detailed statistical analysis is also conducted and presented in Table 5. In this table, it is illustrated that the minimum, maximum, and average values are calculated for each classifier and then σ and CI are computed. Based on CI, the best results are achieved on LDA classifier of maximum accuracy of 90.1%, σ = 0.3681 and CI is 0.2125. The CI of ESD classifier is also plotted in Fig. 11 which described that on 95% confidence level the CI is 89.633±0.417 (±0.46%). The overall CI for LDA classifier is noted of 0.2125. From the results, it is clear that the proposed results are consistent after several numbers of iterations.

Table 5 Statistical analysis of proposed feature selection method using Caltech 101 dataset
Fig. 11
figure 11

Representation of confidence interval at different confidence levels for LDA classifier

In the last, a fair comparison is conducted in term of accuracy and classification time with existing techniques, as presented in Table 5. In this, Song et al. [42] presented a PCA based feature selection technique along with SVM classifier and achieved an accuracy of 83.9% on Catltech-101 dataset. Li et al. [17] performed extreme learning and YCbCr color transformation for object classification and achieved 78% accuracy on Caltech-101 dataset. Pan et al. [30] employed K-Means clustering-based technique for feature reduction and achieved classification accuracy of 85.78%. The more recent, Rashid et al. [33] fused CNN and SIFT features and obtained classification accuracy of 89.7% on Caltech101 dataset. However, our method achieved accuracy of 93.9% on 25 objects classes, 92.6% on 50 objects classes, 90.4% for 75 classes and overall achieved 90% for complete Caltech101 dataset. Moreover, propose method is also outperforms in the form of computational time (Table 6).

Table 6 Proposed results comparison with existing techniques

5.3 Critical analysis

Based on the critical analysis of each step involves in the proposed method, it is described that a huge change is occurs in classification results. The augmentation is a key step in this regard and clearly shows the results in Table 7. In this table, it is observed that the classification accuracy is change after augmentation. Initially, we calculate the results on original Caltech101 dataset and attained accuracy of 80.40%. After horizontal flip for increase in data, the accuracy is increased more than 3%. Further, vertical and transpose operations are performed and accuracy is reached to 90.10%. It is clearly show that the increases in the images of each class train a good model that later gives improved classification accuracy. Further, we test the proposed method on different training/testing ratios; results can be seen in Fig. 12. In this Figure, it is show that the higher training ratio improves the proposed accuracy but it clashes the fair comparison. Hence we consider a ratio (50:50) in proposed work.

Table 7 Change in classification results after data augmentation step
Fig. 12
figure 12

Analysis of results on different training/testing ratios

6 Conclusion

In conclusion, we propose an automated system for object classification using classical and deep features selection. Data augmentation is performed to handle the problem of sufficient training. Then classical features are computed from gray images for the cause of local properties of objects. Later CNN features are computed and combined along with classical features. In the next stage, we get the benefit of best-selected features that are obtained by JE-KNN based method and achieve tremendous accuracy. The best-selected feature method also gives support in reducing the overall computational time. Overall, the proposed method accomplished an accuracy of 90.1% on Caltech101 dataset. The comparison is conducted with recent techniques that show the authenticity of the presented method. However, during the analysis of proposed results we observed that proposed method increases the error rate for few classifiers. As compared to ESD classifier, the difference among accuracy of SVM and Co-KNN is almost 18% which is a huge difference and it is a main limitation of our work. This problem can be resolved through the selection of classifiers such as Softmax, ELM, and Naïve Bayes. In the future, deep reinforcement learning is employed to achieve better accuracy on this dataset. Moreover, a more efficient feature selection method will be proposed and apply to the same system. Furthermore, Caltech256 dataset will be used in the future studies related to object classification.