1 Introduction

In recent decades, ensemble learning has been attracting attention in the field of machine learning because it can effectively solve practical application problems. Dasarathy and Sheela (1979) first proposed the idea of ensemble learning. Ensemble learning is a machine learning method that uses a series of base learners to learn and uses a certain rule to integrate individual learning results to achieve better outcomes than a single learner. Since most ensemble learning algorithms have no restrictions on the type of base learner and it has good applicability to many mature machine learning frameworks, ensemble learning is widely used in various fields. With the development of the times, more ensemble learning algorithms have been proposed, and major breakthroughs have been made in many fields. However, the existing ensemble learning algorithms also have some limitations. For example, the base learner needs to have high sensitivity and efficient learning ability; otherwise, it is easy to produce overfitting, increasing time cost and computational overhead.

With the increase in data scale, the improvement in computing power and the great innovation of algorithms, deep learning has begun to rise. Geoffrey Hinton, a professor at the University of Toronto in Canada and the master of machine learning, and his student Ruslan Salakhutdinov (2006) published an article in science that opened the wave of deep learning in academia and industry. Deep learning, especially convolutional neural networks, not only has excellent feature learning capabilities, but also overcomes training difficulties through layer-by-layer initialization. Therefore, convolutional neural network has been the most in-depth research, and it is widely used in various fields such as computer vision (Krizhevsky et al. 2012) and speech recognition (Hinton et al. 2012), and has achieved best performance in these fields.

In order to improve the classification efficiency, this paper proposes an ensemble learning framework for convolutional neural network based on multiple classifiers for improving the accuracy of classification. This paper mainly consists of two aspects of work: (1) Construct an ensemble learning framework based on different classifiers and compare it with the accuracy of a single classifier; and (2) extract features using CNN for MNIST data set, and then, classify extracted features using an ensemble learning framework.

This article is divided into five parts: Sect. 1 mainly introduces the relevant background, purpose, methods and the organizational structure; Sect. 2 mainly introduces the related work in recent years; Sect. 3 is the introduction of basic theory of convolutional neural network and ensemble learning; Sect. 4 is mainly to introduce the overall framework which describes the whole process; Sect. 5 is the experimental part, which introduces the data sets required, the experimental details, experimental results and analysis. Section 6 is a summary of this paper.

2 Related work

Ensemble learning is a new machine learning paradigm that uses multiple learners to solve the same problem. Because ensemble learning can significantly improve the generalization ability of the learning system, it is widely used in various fields, such as robot assistance (Adama et al. 2018), Web security detection (Zhou and Wang 2019) and online streaming data anomaly detection (Ding et al. 2017). With the deepening of research, many algorithms for optimizing ensemble learning have been proposed: Wang et al.(2016) proposed a multi-core ensemble learning algorithm that can make full use of historical data information to improve the ability of software to predict defects; Zhang et al.(2017) proposed a local hierarchical ensemble framework for hierarchical multi-label classification to solve the problem of ignoring the structural information between different classes in classification algorithm; a hybrid incremental ensemble learning (HIEL) approach proposed by Zhiwen et al. (2019) can simultaneously consider the feature space and the sample space to process the noise data set. Although ensemble learning shows better performance than a single classifier, as data present type diversity and rapid growth, ensemble learning shows some problems such as high computational complexity and low efficiency.

Deep convolutional neural networks (DCNNs) have become one of the most studied models in deep neural networks along with the continuous development of deep learning algorithms. Convolutional neural networks have demonstrated powerful application capabilities in many fields: In computer vision, Mask R-CNN proposed by He et al. (2017) achieved object detection by adding a branch of the target mask in parallel with the existing branch; if the segmentation goal of Mask R-CNN was turned to one-hot, it can also be used for human pose estimation. In natural language processing, WaveNet (van den Oord et al. 2016) used a convolutional neural network to generate a model to output a conditional probability of speech and sample synthesized speech; Zhang et al. (2018) proposed a deep convolutional neural network (DCNN) model using linear support vector machine to achieve speech recognition. In addition, Ji et al. (2019) proposed a Siamese fully convolutional network that can be used to extract buildings from satellite remote sensing images.

As can be seen that the existing methods are a combination of convolutional neural network and a single classifier, algorithms that fuse ensemble learning are lacking. Therefore, this paper proposes an ensemble learning framework for convolutional neural networks to deal with multi-classification problem.

3 Basic theory

3.1 Convolutional neural network

Convolutional neural network (CNN) is a feedforward neural network that includes alternating convolutional layers and pooling layer. The structure of CNN is shown in Fig. 1.

Fig. 1
figure 1

Structure of CNN

The difference between a convolutional neural network and an ordinary neural network is that convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer. In the convolutional layer, one neuron is only connected to a portion of the adjacent layer neurons. After the convolution operation is completed, several feature maps are generated and the neurons of the same feature map share the weight, which means the shared convolution kernel. The immediate benefit of shared weights is the reduction in connections between layers of the network while reducing the risk of overfitting. Subsampling is also called pooling. It usually has two forms: mean pooling and max pooling. Subsampling can be seen as a special convolution process. Convolution and subsampling greatly simplify the model complexity and reduce the parameters of the model.

CNN has great advantages in feature extraction: First, due to the nature of convolution and pooling, the extracted features are less likely to overfitting. Second, the features extracted by CNN are more scientific than the simple projection, direction and center of gravity. Third, the fit of the overall model can be controlled by the size of the different convolution, pooling and final output feature vectors. The dimension of the feature vector can be reduced when the model is overfitting, and the output dimension of the convolution layer can be increased when the model is underfitting.

3.2 Ensemble learning

Ensemble learning accomplishes learning tasks by building and combining multiple classifiers. Specifically, a plurality of N classifiers which have independent decision-making capabilities are combined according to a certain strategy to make a decision. Classifiers are also known as learners. The idea of ensemble learning is shown in Fig. 2.

Fig. 2
figure 2

General idea of ensemble learning

At present, ensemble learning can be divided into two categories according to whether there are dependencies between individual learners: The first type is that there is no strong dependency between individual learners. A series of individual learners can be generated in parallel, and the representative algorithm is Bagging (Breiman 1996) series. The second category is that there is a strong dependency between individual learners. A series of individual learners basically need to be serially generated, and the representative algorithm is Boosting (Schapire 1989) series.

3.3 Bagging

Bagging (Breiman 1996) improves the accuracy of learning algorithms by combining randomly generated training sets, constructing a series of predictive functions and combining them into a predictive function by some strategy. The classification method required by Bagging is unstable. Unstable means that if the selected data set changes very little, it will make a huge change in the final classification result.

Random forest (Breiman 2001) is a specialized and advanced version of Bagging. The so-called specialized is because the base learners of random forests are decision trees. The so-called advanced is random forest which is based on the random sampling of Bagging, plus the random selection of features. The basic idea is not out of the Bagging category yet.

3.4 Boosting

Boosting (Schapire 1989) is an algorithm that combines base learning algorithms into strong learning algorithms. It improves the accuracy of any given learning algorithm by constructing a number of prediction functions and combining them according to matching strategies to generate prediction functions. But Boosting has a major drawback the model needs to know in advance the lower accuracy limit of the base classifier classification. In response to these problems and defects, Adaboost (Freund and Schapire 1995) was proposed.

XGBoost (Chen and Guestrin 2016) is the abbreviation of extreme gradient boosting. It is a machine learning function library focusing on gradient boosting algorithm. This library has gained wide attention due to its excellent learning effect and efficient training speed. XGBoost implements a generic tree boosting algorithm, and the tree model used is usually the CART. The idea of XGBoost is to constantly add trees and continuously perform feature segmentation to generate trees.

4 Framework description

4.1 Overview

This paper mainly contains the following two aspects: (1) ensemble learning framework: Multiple classifiers are used as base classifier of Bagging and Boosting for classification; and (2) CNN + ensemble learning framework: Firstly, the feature is extracted using convolutional neural network, and then, the ensemble learning algorithm in (1) is used for classification prediction.

4.2 Ensemble learning framework

There are two types of ensemble learning algorithms framework: One is based on Bagging and another is based on Boosting.

4.3 Framework based on Bagging

The main steps of the Bagging-based framework are as follows:

  1. (1)

    T training is performed on the training set S = {(x1, y1), (x2, y2),…, (xm, ym)}: In each training, the bootstrap method is used to put back the random sampling of the training set S; that is, only a certain subset St of S is used as the current training set. After T training, T different classifiers Ct can be obtained;

  2. 2)

    Combine the base learning classifiers obtained from each training into final classifier: When classifying a test sample, the T classifiers are, respectively, called to obtain and count T classification results, and the class with the most occurrences is used as the last label.

The frame of the Bagging based on SVM (Cortes and Vapnik 1995) as the base learning classifier C is shown in Fig. 3.

Fig. 3
figure 3

Frame of the Bagging based on SVM

Among them, the base classifier SVM can be replaced by decision tree (Breiman et al. 1984), MLP (Longstaff and Cross 1987), Naive Bayesian (Lewis 1998), KNN (Cover and Hart 1967) and Perceptron (Rosenblatt 1958). The algorithm of Bagging is shown in Algorithm 1.

figure a

4.4 Framework based on Boosting

The Boosting framework is based on the Adaboost algorithm. The goal of the algorithm is to update the weight D. If a sample has been accurately classified, its weight is reduced when constructing the next training set; conversely, if a sample is not accurately classified, its weight is increased. At the same time, the right to speak of the base learning classifier is obtained. Then, the sample set after updating the weight is used to train the next classifier. The main steps of Boosting-based framework are as follows:

  1. 1.

    Initialize the weight distribution D1 of the training data S = {(x1, y1), (x2, y2),…, (xm, ym)}: If there are m samples, each sample point is given the same weight 1/m at the beginning

    $$ \mathop D\nolimits_{1} = \left( {\mathop w\nolimits_{1,1} ,\mathop w\nolimits_{1,2} , \ldots ,\mathop w\nolimits_{1,n} } \right),\quad \mathop w\nolimits_{1,i} = \frac{1}{m},\quad i = 1,2, \ldots ,m $$
    (1)

    where w1,i represents the weight of the ith sample of the first iteration.

  2. 2.

    T-round training for the weighted training set S: In each iterative, the base learning classifier Ct is trained on the training set S given the weight Dt. According to the classification result, the error function value et and the current classifier’s utterance right at are calculated:

    $$ \mathop e\nolimits_{\text{t}} = \sum\limits_{i = 1}^{m} {w_{{{\text{t}},i}} } I\left( {\mathop C\nolimits_{\text{t}} (\mathop x\nolimits_{i} ) \ne \mathop y\nolimits_{i} } \right) $$
    (2)
    $$ \mathop a\nolimits_{\text{t}} = \frac{1}{2}\log \frac{{1 - \mathop e\nolimits_{\text{t}} }}{{\mathop e\nolimits_{\text{t}} }} $$
    (3)

    where Ct(xi) represents the category predicted by using Ct for the ith sample point in the tth iteration; I(·) represents the error between the real category and the prediction category; the utterance right at indicates the importance of Ct(x) in the final classifier.

    The weight distribution Di+1 used for the next iteration is updated according to et and at:

    $$ \mathop D\nolimits_{i + 1} = \left( {\mathop w\nolimits_{t + 1,1} ,\mathop w\nolimits_{t + 1,2} , \ldots ,\mathop w\nolimits_{t + 1,m} } \right), \quad i = 1,2, \ldots ,m $$
    (4)
    $$ \mathop w\nolimits_{{{\text{t}} + 1,i}} = \frac{{\mathop w\nolimits_{t,i} }}{{\mathop Z\nolimits_{t} }}\exp \left( { - \mathop a\nolimits_{t} \mathop y\nolimits_{i} \mathop C\nolimits_{t} (\mathop x\nolimits_{i} )} \right) $$
    (5)
    $$ \mathop Z\nolimits_{\text{t}} = \sum\limits_{i = 1}^{N} {\mathop w\nolimits_{t,i} \exp \left( { - \mathop a\nolimits_{t} \mathop y\nolimits_{i} \mathop C\nolimits_{t} (\mathop x\nolimits_{i} )} \right)} $$
    (6)

    where wt, i is the weight of the ith sample at the tth iteration; yi is the real class of the ith sample; Zt is the normalization factor such that the sum of the weights corresponding to all samples is 1.

  3. 3.

    Combine the base learning classifiers obtained from each training into a strong final classifier. After the training, the base learning classifier with small classification error rate has a large utterance right, which plays a greater role in the final classification function, and base learning classifier with large classification error rate has a small utterance right, which plays a minor role in the final classification function. In other words, a base learning classifier with a low error rate accounts for a larger proportion in the final classifier, and vice versa. As shown in Eq. (7):

    $$ C(x) = sign(\sum\limits_{t = 1}^{T} {\mathop a\nolimits_{t} \mathop C\nolimits_{t} (x)} ) $$
    (7)

    where sign(·) is a symbolic function and x is a sample to be prediction.

The frame of the Boosting based on SVM as the base learning classifier C is shown in Fig. 4.

Fig. 4
figure 4

Frame of the Boosting based on SVM

Among them, the base classifier SVM can be replaced by decision tree, SVM, Naive Bayesian and Perceptron because the base classifier of the Boosting algorithm needs to support the sample weight. The algorithm of Bagging is shown in Algorithm 2.

figure b

4.5 CNN + ensemble learning framework

CNN has excellent advantages in feature extraction, so in this framework, CNN is first used to extract features from the image and then classified using the ensemble learning framework. The framework is shown in Fig. 5.

Fig. 5
figure 5

Frame of the CNN + ensemble learning

The above figure shows the overall framework of CNN + ensemble learning framework. After extracting features using CNN, an ensemble learning algorithm framework was added behind the F6 layer for classification. In CNN, all convolution kernel sizes are 5 × 5, stride is 1, the pooling layer uses the max pooling size of 2 × 2. The details are shown in Table 1.

Table 1 Structure and parameters of the CNN

5 Experimental results and analysis

5.1 Data set

This section describes the UCI data sets Iris, Wine quality-red and handwritten numerical recognition of MNIST. Among them, Iris contains 150 samples, a total of 3 categories, corresponding to each row of data in the data set; the classes of Wine quality-red are ordered and not balanced; MNIST is composed of 60,000 training pictures and 10,000 test pictures, a total of 10 categories. Each picture is 28 × 28 in size, and both are black and white. It is a 0–1 floating point number. The darker the black color is, the closer the value is to 1. Table 2 shows the detailed description of the three data sets.

Table 2 Details of the three data sets

5.2 Experimental parameters

Since the experiments are performed on three data sets, in order to achieve the comparability of the experimental results, the base classifiers with the same parameters are used for the same data set. For the three data sets, the parameters of these base classifiers are the same: decision tree uses entropy to select features; MLP has 2 layers, 100 neurons in the first layer and 50 neurons in the second layer, the loss function is stochastic gradient descent, and the activation function is ReLU; the K-Nearest Neighbor has a k value of 5; random forest uses bootstrap samples and selects features based on Gini coefficient, while using out-of-bag samples to estimate generalization accuracy; XGBoost uses general balanced trees as the base classifier. The parameters of these base classifiers are different: For Iris, the kernel function of SVM is the radial basis function; the kernel function of Naive Bayes is Polynomial. For MNIST, the kernel function of SVM is Polynomial; the kernel function of Naive Bayes is Gaussian. For Wine quality-red, the kernel function of SVM is Sigmoid; the kernel function of Naive Bayes is Polynomial.

5.3 Evaluation

The evaluation of the model uses a confusion matrix. The confusion matrix consists of four important indicators: TP (true positives), which means that it is a positive example and is also recognized as a positive proportion; FP (false positives), which means itself is a negative example, but is identified as a positive proportion; FN (false negatives), which means that it is a positive example, but it is recognized as a negative proportion; and TN (true negatives), which means that it is a negative example, and it is also recognized as a negative proportion. According to these parameters, the four evaluation indicators of accuracy, precision, recall and F1 score are used in this experiment. The specific calculation is as shown in Eqs. (8)–(11).

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FN}} + {\text{FP}}}} $$
(8)
$$ {\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}} $$
(9)
$$ {\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} $$
(10)
$$ F1{\text{ score}} = \frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$
(11)

5.4 An ensemble learning framework based on different base classifiers

Ensemble learning is an algorithm that combines multiple base classifiers to get a more comprehensive and powerful classifier; thus, this section explains the following experiments on the Iris, Wine quality-red and MNIST data sets. First, a single classifier is used for classification. Then classification is performed using Bagging and Boosting based on different base classifiers. The accuracy (%) is shown in Tables 3, 4 and 6. Finally, the accuracy (%) for the random forest and XGBoost is given in Tables 5 and 7. And Fig. 6 shows a graph of the relationship between the accuracy and the number of base classifiers.

Table 3 Accuracy of single classifier
Table 4 Accuracy of Bagging
Table 5 Accuracy of random forest
Fig. 6
figure 6

Relationship between accuracy and the number of base classifier

Table 3 shows that for a small data set Iris, the accuracy of all single classifiers is more than 90% and KNN has the highest accuracy 96.7%; for the unbalanced data set Wine quality-red, the accuracy is relatively low, between 40 and 50%, decision tree has the highest accuracy 59.2%; for the largest picture data set MNIST, the correct rate is more than 80% and SVM has the highest accuracy 97.1%.

Combined with Tables 4 and 5, it can be seen that for Iris and MNIST, the accuracy of most algorithm is more than 90%, which is a few percentage points higher than the accuracy in Table 3. KNN and SVM also have the highest accuracy 97% and 98.5, respectively; for unbalanced Wine quality-red, the accuracy is 50–60%, which is almost 10% higher than single classifier. Although decision tree is lower than the highest accuracy of random forest, it also reached 68.7%. In addition, random forest has the highest accuracy rate of 70% for the unbalanced data set Wine quality-red and the performance of Iris and MNIST is also great. Therefore, random forest has better robustness.

Combined with Tables 6 and 7, it can be concluded that for Iris, the accuracy of most Boosting is more than 90%, which is a few percentage points higher than the accuracy in Table 3, except Boosting based on Perceptron; for Wine quality-red, the accuracy of most algorithm is more than 50%, and Boosting based on Perceptron has a lower accuracy 43.8%; for MNIST, the accuracy is more than 80% and SVM has the highest accuracy 97.8%. XGBoost achieved the best results on the Wine quality-red data set compared to other Boosting methods. It is as robust as random forest.

Table 6 Accuracy of Boosting
Table 7 Accuracy of XGBoost

In Fig. 6, the first column is a graph of the accuracy and the number of base classifiers based on Bagging framework for three data sets; the second column is a graph of the accuracy and the number of base classifiers based on Boosting framework for three data sets. Among them, the base classifier iterates once for every 50 increments. As can be seen that for the small-scale data set Iris, only a few classifiers can achieve the highest classification accuracy; for the unbalanced data set Wine quality-red, Boosting framework converges faster than Bagging framework. Besides, random forest and XGBoost have better robustness; for MNIST, Boosting framework is also faster than Bagging framework convergence, and only a few classifiers can achieve better accuracy.

In summary, the accuracy of Bagging framework and Boosting framework based on different base classifiers is improved relative to a single classifier. The accuracy of Boosting framework will be a little lower than Bagging framework on the whole, but it has a faster convergence. Random forest and XGBoost have great performance of all data sets, especially on the unbalanced data set Wine quality-red. This shows that random forest and XGBoost have better robustness.

5.5 An ensemble learning framework for CNN based on different base classifiers

If the extracted features are not comprehensive, how to adjust the parameters of the classification does not reach an ideal state. On the contrary, if the extracted features are great, it is easy to get a higher accuracy. Therefore, in this experiment, first use CNN with superior feature extraction ability to extract the features from MNIST image data set and then use the ensemble learning algorithm of Sect. 5.3 to classify and obtain the accuracy and precision. The experimental results are shown in Tables 8, 9, 10, 11. And Fig. 7 shows a graph of the relationship between the accuracy and the number of base classifiers.

Table 8 Accuracy of CNN + Bagging
Table 9 Accuracy of CNN + random forest
Table 10 Accuracy of CNN + Boosting
Table 11 Accuracy of CNN + XGBoost
Fig. 7
figure 7

Relationship between accuracy and the iteration

Combined with Tables 8 and 9, it can be found that the accuracy of CNN + Bagging framework is above 98%, which is significantly improved compared with the single Bagging framework; in particular, NB is the most obvious. According to F1 score, the classification performance of SVM-based Bagging framework is the best.

Combined with Tables 10 and 11, it can be discovered that the accuracy of CNN + Boosting framework is above 97%, which also has a conspicuous improvement over Boosting framework. According to F1 score, SVM-based Bagging framework has the best performance and XGBoost is close behind.

In Fig. 7, record a value every 50 iterations. It can be seen that the accuracy of using CNN + ensemble learning framework is higher than single ensemble learning framework, and after using CNN to extract features, the gap between each ensemble learning is not large as the number of iterations increases. However, the accuracy of Perceptron-based CNN + Boosting is still lower than other algorithms, consistent with the results of the single Boosting framework.

In summary, the accuracy of the features extracted from convolutional neural network using the ensemble learning framework is higher than the accuracy using only ensemble learning framework. In the case of using convolutional neural network at the same time, the results between each ensemble learning framework are not much different.

6 Conclusion

Because the ensemble learning algorithm can achieve better accuracy than a single traditional classifier and CNN can extract more comprehensive and deeper features, this paper proposes an ensemble learning framework for convolutional neural networks based on multiple classifiers. First, based on the Bagging and Boosting algorithms, the three data sets Iris, Wine quality-red and MNIST are classified using an ensemble learning algorithm that is integrated using different single classifiers. The results show that compared with the single classifier, this method improves the accuracy. Although the accuracy of the Bagging framework is generally higher than that of the Boosting framework, the Boosting framework can converge faster. Then using the CNN to extract the features from MNIST data set, the above-mentioned ensemble learning framework is used to classify the extracted features. The results show that the classification effect obtained is better than the single ensemble learning algorithm framework. At the same time, the model converges faster when using CNN to extract features. In addition, it can be found that random forest and XGBoost have better robustness. In the future, the work of this paper will focus on verifying larger data sets.