Keywords

1 Introduction

Image classification problem is one of the key research objectives in the field of image processing, and has a wide range of applications in object recognition, content understanding as well as image matching and so on. Over the years, although there have been substantial research results, classification problem is still the focus of researches due to the complexity and diversity of image information. This task worth researching for it can be wildly applied in the fields of pattern recognition and computer vision. Image classification methods vary from numerous aspects including Support Vector Machine (SVM), Nearest Neighbor (NN), Gradient Boosting (GB), Convolutional Neural Network (CNN), etc. These machine learning and data-driven algorithms are able to efficiently classify images and have indicated their reliability and validity. Nevertheless, most of the existing methods on the application of image classification do not sufficiently utilize image information as well as establish robust features for recognizing, thus leaving the space for improve the recognition rate through providing reliable high-level features.

In the aspect of intelligent image classification methods, the model based on neural network is one of the important research directions. Deep neural networks can theoretically approximate any complex function and effectively solve the problems of image feature extraction and classification. However, due to the model complexity, training difficulty and high cost, this kind of structure can hardly obtain very effective application. Recently, the latest research on neural network- Deep Learning techniques have obtained continuous breakthroughs and developments in many fields, including image classification [1, 2], object detection [3, 4] and face recognition [5,6,7], etc. Deep architectures have been successfully applied to large-scale image processing system and achieved the state of the art performance, which showing an optimistic prospect to solve the classification task.

Feature extraction is the most important process in an automatic image classification system. The feature quality can directly influence the recognition performance which leads to a time-consuming feature engineering in traditional image classification task. CNN is an efficient Deep Learning model with hierarchical structure to learn high quality features at each layer. Since the model can reduce the complexity of network structure and the number of parameters through local receptive fields, weight sharing and pooling operation, it has been widely used in image classification problem and achieved excellent results. It also can directly input raw images in automatic classification system, which contributes to save more image information for the following feature extraction.

On the other hand, classification is another significant process in an automatic image classification system. CNN has been recognized as the most powerful and effective mechanism for feature extraction, but traditional classifiers connected to CNN do not fully understand the extracted features. Therefore showing a promising direction for proposing new solution to the image classification problem. eXtreme Gradient Boosting (XGBoost) [9] is an integrated learning algorithm based on GB, the principle of which is to achieve accurate classification results through iterative computation of weak classifiers. XGBoost is widely applied in many domains [7, 8, 10] because of its high efficiency and accuracy.

Motivated by the above facts, this paper explores the incorporation of CNN model and XGBoost algorithm since both CNN and XGBoost have already perform excellently in image classification problem. A novel image classification method with CNN-XGBoost model is proposed to improve the performance of image classification problem. The proposed CNN-XGBoost model provides more precise output by integrating CNN as a trainable feature extractor to automatically obtain features from input and XGBoost as a recognizer in the top level of the network to produce results. Such unique two-stage model guarantees the high reliability feature extraction and classification.

The rest of the paper is organized as follows: Sect. 2 introduces the basic concepts of CNN and XGBoost respectively. Then the CNN-XGBoost model is also described in this section. Experimental results on the well-known MNIST and CIFAR-10 databases are presented and discussed in Sect. 3. Finally, Sect. 4 concludes the paper.

2 The Novel CNN-XGBoost Model

In this paper, we propose a novel image classification method with CNN-XGBoost model to improve the classification performance. By integrating CNN as a trainable feature extractor to automatically obtain features from input and XGBoost as a recognizer in the top level of the network to produce results, our method can guarantee the high reliability feature extraction and classification. In the following section, we will briefly describe the two brilliant classifiers respectively and introduce the novel CNN-XGBoost model at last.

2.1 Convolutional Neural Network

Convolutional neural network (CNN) was first proposed by Professor Yann LeCun and his colleagues at the University of Toronto in Canada and used for recognition and classification of handwriting digital images [11]. CNN takes advantage of the concepts of receptive fields, weight sharing and sub-sampling (pooling) to reduce the complexity of the network structure and the number of parameters. “Receptive field” is equivalent to constructing a number of spatially localized filters which can obtain some salient features of the input. While “weight sharing” can reduce the number of parameters which needs to be trained. “Pooling” can simplify the model and prevent it from over-fitting.

A typical CNN is consisted of alternating convolution and sub-sampling layers, then turns into fully connected layers when approaching to the last output layer. It usually adjusts all the filter kernels (convolution kernels) by back-propagation algorithm [12], which is based on stochastic gradient descent algorithm, to reduce the gap between the network output and the training labels. Overall, the convolution layer (C layer) obtains the local features by connected with local receptive fields. The sub-sampling layer (S layer) is a mapping feature layer which is used for pooling operation and completing the secondary extraction calculations. Each C layer is followed by an S layer, and the special twice feature extraction structure makes convolutional neural network have strong distortion tolerance on the input images.

LeNet-5 [11] is a classic convolutional neural network architecture for handwritten digit recognition proposed by Yann LeCun et al. The structure consists of 8 layers: an input layer, two C layers, two S layers and three fully connected layers and an output layer. In order to reduce the computational complexity, the specific structure of CNN for image classification in this paper is a simplified version of LeNet-5, which includes an input layer (Input), two C layers (C1, C2), two S layers (S1, S2), fully connected layers and an output layer (Output). Figure 1 demonstrates the specific structure of CNN for image classification. Since a convolution kernel of the convolution layer can only extract one characteristic of input feature maps, it requires multiple convolution kernels to extract different features.

Fig. 1.
figure 1

The Specific structure of CNN for image classification

2.2 eXtreme Gradient Boosting

XGBoost has been widely used in many fields to achieve state-of-the-art results on many data challenges, which is a high effective scalable machine learning system for tree boosting. Developed by Tianqi Chen et al. the scalability in all scenarios of XGBoost is due to several important systems and algorithmic optimizations, which includes a novel tree learning algorithm, a theoretically justified weighted quantile sketch procedure as well as parallel and distributed computing [13].

Tree boosting is a very effective ensemble learning algorithm, which can transform several weak classifiers into a strong classifier for better classification performance. Let \( D = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}(\left| D \right| = n,x_{i} \in {\mathbb{R}}^{m} ,y_{i} \in {\mathbb{R}}^{n} ) \) represents a database with \( n \) examples and \( m \) features. A tree boosting model output \( \hat{y}_{i} \) with K trees is defined as follows:

$$ \hat{y}_{i} = \sum\limits_{k = 1}^{K} {f_{k} (x_{i} )} ,f_{k} \in F $$
(1)

where \( F = \{ f(x) = \omega_{q} (x)\} (q:{\mathbb{R}}^{m} \to T,\omega \in {\mathbb{R}}^{T} ) \) is the space of regression or classification trees (also known as CART). Each \( f_{k} \) divides a tree into structure part \( q \) and leaf weights part \( \omega \). Here \( T \) denotes the number of leaves in the tree.

The set of function \( f_{k} \) in the tree model can be learned by minimizing the following objective function:

$$ O = \sum\limits_{i} {l(\hat{y}_{i} ,y_{i} )} + \sum\limits_{k} {\Upomega (f_{k} )} $$
(2)

The first term \( l \) in Eq. (2) is a training loss function which measures the distance between the prediction \( \hat{y}_{i} \) and the object \( y_{i} \). The second term \( \Upomega \) in Eq. (2) represents the penalty term of the tree model complexity.

Tree boosting model whose objective function is Eq. (2) cannot be optimized through traditional optimization methods in Euclidean space. Gradient Tree Boosting is an improved version of tree boosting by training tree model in an additive manner, which means the prediction of the t-th iteration \( \hat{y}^{(t)} = \hat{y}^{(t - 1)} + f_{t} (x) \). And the objective function in t-th iteration is changed as:

$$ O^{(t)} = \sum\limits_{i = 1}^{n} {l(y_{i} ,\hat{y}_{i}^{(t - 1)} + f_{t} (x_{i} ))} + \Upomega (f_{t} ) $$
(3)

XGBoost approximates Eq. (3) by utilizing the second order Taylor expansion and the final objective function at step t can be rewritten as:

$$ O^{(t)} \simeq \tilde{O}^{(t)} = \sum\limits_{i = 1}^{n} {[l(y_{i} ,\hat{y}_{i}^{(t - 1)} ) + g_{i} f_{t} (x_{i} ) + \frac{1}{2}h_{i} \,f_{t}^{2} (x_{i} )]} + {\Upomega}(f_{t} ) $$
(4)

where \( g_{i} \) and \( h_{i} \) are first and second order gradient statistics on the loss function, and \( {\Upomega} (f) = \gamma T + \frac{1}{2}\lambda \left\| \omega \right\|^{2} \) in XGBoost.

Denote \( I_{j} = \{ i|q(x_{i} ) = j\} \) as the instance set of leaf \( j \), after removing the constant terms and expanding \( \Upomega \), Eq. (4) can be simplified as:

$$ \tilde{O}^{(t)} = \sum\limits_{j = 1}^{T} {[(\sum\limits_{{i \in I_{j} }} {g_{i} } )\omega_{j} + \frac{1}{2}(\sum\limits_{{i \in I_{j} }} {h_{i} } + \lambda )\omega_{j}^{2} ] + \gamma T} $$
(5)

The solution weight \( \omega_{j}^{*} \) of leaf \( j \) for a fixed tree structure \( q(x) \) can be obtained by applying the following equation:

$$ \omega_{j}^{*} = - \frac{{\sum\nolimits_{{i \in I_{j} }} {g_{i} } }}{{\sum\nolimits_{{i \in I_{j} }} {h_{i} + \lambda } }} $$
(6)

After substituting \( \omega_{j}^{*} \) into Eq. (5), there exists:

$$ \tilde{O}(q) = - \frac{1}{2}\sum\limits_{j = 1}^{T} {\frac{{(\sum\nolimits_{{i \in I_{j} }} {g_{i} } )^{2} }}{{\sum\nolimits_{{i \in I_{j} }} {h_{i} + \lambda } }}} + \gamma T $$
(7)

Define Eq. (7) as a scoring function to evaluate the tree structure \( q(x) \) and find the optimal tree structures for classification. However, it is impossible to search the whole possible tree structures \( q \) in practice. [13] describes a greedy algorithm that starts from a single leaf and iteratively adds branches to grow the tree structure. Whether adding a split to the existing tree structure can be decided by the following function:

$$ O_{split} = \frac{1}{2}\left[ {\frac{{(\sum\nolimits_{{i \in I_{L} }} {g_{i} } )^{2} }}{{\sum\nolimits_{{i \in I_{L} }} {h_{i} + \lambda } }} + \frac{{(\sum\nolimits_{{i \in I_{R} }} {g_{i} } )^{2} }}{{\sum\nolimits_{{i \in I_{R} }} {h_{i} + \lambda } }} - \frac{{(\sum\nolimits_{i \in I} {g_{i} } )^{2} }}{{\sum\nolimits_{i \in I} {h_{i} + \lambda } }}} \right] - \gamma $$
(8)

where \( I_{L} \) and \( I_{R} \) are the instance sets of left and right nodes after the split and \( I = I_{L} \cup I_{R} \).

XGBoost is a fast implementation of GB algorithm, which has the advantages of fast speed and high accuracy. This XGBoost classifier is added to the top level of the CNN to produce results for image classification in our paper.

2.3 CNN-XGBoost Model

In this paper, the specific structure of the CNN-XGBoost model for image classification is shown in Fig. 2. First, the input image data is normalized and transfer to the input layer of CNN. After training CNN by BP algorithm for several epochs to obtain a proper structure for image classification, XGBoost replaces the output layer, a soft-max classifier, of CNN and utilizes the trainable features from CNN for training. Finally, the CNN-XGBoost model gets the new classification results of testing images. Our CNN-XGBoost model can automatically obtain features from input and provides more precise classification results combining the two outstanding classifiers.

Fig. 2.
figure 2

The Specific structure of CNN-XGBoost model for image classification

3 Experimental Results

In order to verify the improvement and validity of the above-mentioned method, we compare it with the classical ones by carrying out experiments on different databases respectively. We also compare the classification results of different methods on the same databases to evaluate the effectiveness of CNN-XGBoost model. The databases, parameter settings and classification results are shown in the following.

3.1 Database

The two selected databases in this paper are two commonly databases used in image classification problems. They are MNIST handwritten digital database and CIFAR-10 color image database. The two databases are universal, which means it is convenient to compare to other methods.

MNIST handwritten digital database is a subset of NIST dataset, which is composed of SD-1 and SD-3 dataset. It has a total number of sixty thousand training pictures and ten thousand test images, all of which are handwritten 0-9 grayscale image.

Sixty thousand training samples are handwritten digits from about 250 individuals, part of the dataset is displayed in Fig. 3.

Fig. 3.
figure 3

Part of MNIST database

CIFAR-10 database contains 10 categories with a total of 60000 color pictures. It is divided into five training sets and a sample set and each set are ten thousand pictures. Figure 4 is the results for random selection of 10 picture results.

Fig. 4.
figure 4

Random selection of Cifar-10 database

3.2 Parameter Settings

To determine the parameters of CNN based on PCA initialization for image classification, we use 5 iterations as standard and calculate average classification accuracy rates of 5 times for diverse numbers of convolution kernels of the two C layers on the two databases. Finding the numbers of feature maps corresponding to maximum accuracy rate as parameters.

Considering the complexity and run-time of the network structure, we have the numbers of the first C layer feature maps tuned in the range from 3 to 12 with step size of 1, and the numbers of the second C layer feature maps in the range from 3 to 21 with step size of 3 for MNIST database. Similarly, we experiencedly have the numbers of the first C layer feature maps tuned in the range from 22 to 40 with step size of 2, and the numbers of the second C layer feature maps in the range from 24 to 72 with step size of 8 for CIFAR-10 database. Then we calculate the CNN classification accuracy rates respectively for each different feature maps. As it is shown in Figs. 5 and 6, the numbers of the feature maps in MNIST database are chose as 10 and 21, while CIFAR-10 database are 38 and 72.

Fig. 5.
figure 5

Feature map number selection on MNIST database

Fig. 6.
figure 6

Feature map number selection on CIFAR-10 database

Taking into account that the image sizes of MNIST and CIFAR-10 datasets are \( 28 \times 28 \) and \( 32 \times 32 \) respectively, the size of convolution kernels and sampling area are set on the basis of LeNet-5 [11] as \( 5 \times 5 \) and \( 2 \times 2 \).

When training the CNN on the above two databases, we take 128 pictures as a batch and whole pictures as an iteration to calculate the classification accuracy for 100 iterations after determining the feature map numbers. The classification accuracy rates on train and test sets of the two databases are displayed in Figs. 7 and 8 respectively.

Fig. 7.
figure 7

Classification accuracy rates of MNIST database

Fig. 8.
figure 8

Classification accuracy rates of Cifar-10 database

For the selection of parameters in XGBoost classifier, it can be determined by testing the final classification accuracy rate under different iterations, maximum tree depths and shrinkage steps. In this paper, we choose five iteration values of 100, 200, 300, 400, 500, five maximum tree depth values of 4, 6, 8, 10, 12 and five shrinkage step values of 0.005, 0.01, 0.05, 0.1, 0.2 to enumerate the classification accuracy rates in different XGBoost parameter settings and the final details of CNN-XGBoost model parameter settings are described in Table 1.

Table 1. Parameters in CNN-XGBoost model

3.3 Results and Analysis

To test the effectiveness and reality of the proposed model in this paper for image classification, we evaluate it on the above two database. In addition, we compare the classification accuracy rate with several different intelligent classification models on the same databases and take 100 iterations as a uniform standard for the trainable models. SAE is the abbreviation for Stacked Auto-Encode. Table 2 lists the exact accuracy rates of these models. Evidently, it can be seen that the model proposed in this paper has higher classification accuracy rates on two databases than others and presents that the model can really improve the classification performance. The reason probably lies in that we utilize CNN to automatically extract high quality features with less loss of image information and high speed XGBoost to achieve efficient classification. These observations impressively demonstrate the effectiveness and reality of the proposed novel image classification method with CNN-XGBoost model. If we continue to increase the number of iterations of CNN optimization process and improve the hardware operating conditions, we believe it can get higher classification accuracy.

Table 2. Classification accuracy rate of several difference models on two databases

4 Conclusion

In order to further enhance the classification performance of a typical Deep Learning framework – CNN, which owns outstanding performance in the image classification problem, this paper proposes novel image classification method with CNN-XGBoost model. By combining the CNN and XGBoost classifiers, this model provides more precise output by integrating CNN as a trainable feature extractor to automatically obtain features from input and XGBoost as a recognizer in the top level of the network to produce results. Experiments are implemented on the well-known MNIST and CIFAR-10 databases to examine the performance and the results demonstrate that the new method performs better compared with other methods on the same databases, which verify the effectiveness and reality of the proposed method in image classification problem.

Further work will be focus on adjusting the CNN structure to further extract higher quality features and speeding up the convergence of the cost function by changing the optimization techniques to promote the classification result and training effect.