1 Introduction

A classifier is a system that uses known knowledge to assign the unknown object to one class or category. The most common way of building the mapping function is to learn the previous classification instance by a specific learning algorithm. Although there are many classification algorithms, such as Decision Tree [4], Naïve Bayes (NB) [31], Artificial Neural Networks (ANN) [26], K-Nearest Neighbors (KNN) [13], Support Vector Machine (SVM) [33], Class Association Rules (CARS) [15] and others [12, 19, 25], there is not a single method that is superior than any other. So, the strategy of integrating different classification algorithms has attracted the interest of more and more researchers [23, 29, 34, 35].

The main idea of a multi-classifier ensemble algorithm is to use some complementary predictions to obtain a better performance than base classifiers [9]. Therefore, to obtain the final ensemble decision, it is necessary to establish a process of combining base decisions. There are two kinds of ensemble strategies for combining base classifiers: selection and fusion [20]. In classifier selection, each base classifier has its domain, where they are the most reliable classifiers. Ensemble decision is to select the corresponding base classifier decision as the final decision based on the domain of the unknown object. In classifier fusion, ensemble decision is established by combining some complementary base classifiers. The existing classifier fusion methods are the average method, voting method, weighting method, and others, such as meta-learning methods [10, 30].

In the ensemble method based on classifier fusion, the main premise to obtain best ensemble performance is that the error rate of base classifiers must be very low, and the error made by one classifier must be compensated by the correct prediction of other base classifiers. That is, the basic classifiers must be as accurate and complementary as possible [14]. However, The more accurate the basic classifiers are, the more similar they are. And the better the complementarity of classifiers, the more uncertain the prediction of classifiers. So, the main challenge for classifier ensemble is to achieve the balance between accuracy and complementarity.

Faced with this challenge, a multi-classifier ensemble algorithm based on probabilistic classifier and distinguish fusion strategy is proposed in this paper. Firstly, considering that classifiers with different principles are more likely to complement each other, classifier training should be as easy as possible. Support Vector Machine(SVM), Echo State Network(ESN), k-Nearest Neighbor(KNN), and Extreme Learning Machine(ELM) are selected as the base classifiers. The posterior probability estimation methods are proposed based on probability theory to obtain four probability classifiers to output the possible probability of each category in different basic classifiers. Then, the prediction of each base probability classifier is regarded as a piece of evidence to form the body of evidence. In order to avoid recognition errors caused by highly conflicting evidence, the distinguish fusion strategy based on conflict coefficient is proposed, i.e., the body of evidence with conflicting evidence is preprocessed first, and then fused by utilizing Dempster’s rule, otherwise, Dempster’s rule can be directly used for fusion. Finally, the recognition results of the proposed multi-classifier ensemble algorithm can be obtained by converting the fusion results into labels. In the experimental analysis, eight public datasets are used to verify the proposed ensemble algorithm, and the analytical conclusion shows that the proposed algorithm has better classification performance in different scenarios than single models, the voting-based method, and the Multi-modal method [38].

The rest of the paper is organized as follows. Section 2 introduces some preliminaries related to this paper. Section 3 shows the detailed process of the proposed multi-classifier ensemble algorithm. Section 4 gives the analysis and discussion of comparative experiments. Section 5 is the conclusion and outlook for our works.

2 Preliminaries

2.1 Dempster/Shafer Evidence Theory

Evidence theory is an inexact reasoning theory first proposed by Dempster [6] in 1967 and further developed by Shafer [32] in 1976, also known as Dempster/Shafer evidence theory (D-S evidence theory). Its basic definition and combination rules are as follows:

Discernment Frame The discernment frame \(\varOmega \) is defined as a non-empty set, \(\varOmega = \{x_1,x_2,\ldots ,x_n\}\), which contains mutually exclusive events. The power set of discernment frame \(2^\varOmega \) contains \(2^n\) elements, which are represented as follows:

$$\begin{aligned} 2^\varOmega =\{ \emptyset ,\{x_1\},\{x_2\},\{x_3\},\ldots ,\{x_n\},\{x_1,x_2\},\ldots ,\{x_1,x_2,\ldots \},\ldots ,\varOmega \} \end{aligned}$$
(1)

Mass Function It is also called Basic Probability Assignment(BPA). In the discernment frame, the basic probability assignment function is defined to represent uncertain information. The mass function m is the mapping of power set \(2^\varOmega \) on the interval [0, 1], which satisfies the following conditions:

$$\begin{aligned} \left\{ \begin{array}{rl} m(\emptyset ) = 0 \\ {\sum }_{x\in 2^\varOmega } m(x) =1 \end{array} \right. \end{aligned}$$
(2)

If the above conditions are true, the subset of the power set \(2^\varOmega \) is called a focal element. m(x) is the value of mass function.

Dempster’s Combination Rule In the framework of evidence theory, two independent mass functions \(m_1\) and \(m_2\) can be fused through the following Dempster’s combination rule:

$$\begin{aligned} m(x)=(m_1\oplus m_2)(x) = \frac{\sum _{B \cap C=x}m_1(B)m_2(C)}{1-K} \end{aligned}$$
(3)

Here, B and C is a subset of \(2^\varOmega \). K is the normalization factor, defined as follows:

$$\begin{aligned} K = \sum _{B \cap C =\emptyset }m_1(B)m_2(C) \end{aligned}$$
(4)

or

$$\begin{aligned} K =1- \sum _{B \cap C \ne \emptyset }m_1(B)m_2(C) \end{aligned}$$
(5)

2.2 Single Classifier

For each ensemble algorithm, it is very important to choose a suitable single model. The principle of classifier determines the analysis perspective of the training model in judging the category of new instances. In other words, classifiers developed based on different principles are more likely to be trained into complementary models. Based on this idea, SVM, ESN, KNN, and ELM are selected as the base classifiers of the ensemble algorithm, and their training process is very simple, which can quickly get the base classifier. The basic principles of these four classifiers are introduced as follows.

Support Vector Machine Support Vector Machine(SVM), which is proposed by Vapnik with colleagues [5], is a kind of generalized linear classifier, it can separate data in a supervised learning manner, and its decision boundary is the maximum margin hyperplane. When there are more than two categories to be divided, SVM will convert this problem into multiple binary classification problems. In target recognition and feature-based classification applications, SVM has a pretty good performance in classifying categories based on the maximum margin hyperplane [7, 11, 22, 33]. Therefore, we choose it as a single model to verify the effectiveness of the ensemble algorithm, hoping to find its unique evidence from the perspective of SVM.

Echo State Network Echo State Network (ESN), proposed by Jaeger [17, 18], is a recurrent neural network. Its hidden layer is generally sparsely connected, usually only 1% connectivity. The connectivity and weights of hidden neurons are randomly assigned as constants. The main interest for ESN is that its behavior is non-linear, but the only weights modified during training are the mapping functions that connect hidden neurons to output neurons. And it is widely used in regression analysis [3, 21, 37] due to its simple network structure.

k-Nearest Neighbor k-Nearest Neighbor (KNN) [1] was proposed by Thomas Cover and is a non-parametric method for classification and regression. KNN, which relies on distance for classification, is an algorithm based on instance learning or lazy learning. It is very simple to implement and the error is easy to control. And it can handle a series of non-linear problems and is active in various regression or classification applications [2, 8, 13, 39]. Therefore, KNN can also be selected as a base classifier to generate characteristic evidence for the ensemble classification algorithm.

Extreme Learning Machine Extreme Learning Machine(ELM), proposed by Huang Guangbin [16], is a feed-forward neural network. In ELM, the neurons in the hidden layer are randomly allocated and never updated, that is, ELM is a projection having the nonlinear transformation. Since ELM has good generalization performance, it has achieved good results in classification and regression applications [24, 36, 40].

3 Proposed Multi-classifier Ensemble Algorithm

Using the idea of classifier fusion, four classifiers SVM, ESN, KNN, and ELM are selected as base classifiers, and these classifiers output the probability of each prediction category instead of the label. On the basis of previous research on solving the issue that highly conflict evidence is easy to lead to counterintuitive conclusions [42, 43], the novel multi-classifier ensemble algorithm is proposed based on the D-S evidence theory. The schematic diagram of the algorithm is shown in Fig. 1, which is described as follows:

Fig. 1
figure 1

Schematic diagram of the proposed multi-classifier ensemble algorithm

  1. 1.

    Preprocessing data. There may be null values and characters in the original data that are difficult for the classifier to handle directly, so data preprocessing is required. Here, the real numbers will be used as labels to replace characters and nulls, and all records will be normalized to reduce the impact of the range of values on classification performance.

  2. 2.

    Generating BPAs for each classifier. Different from general classifiers, each classifier in the proposed ensemble algorithm outputs the probability of each category instead of the category label. The prediction of each classifier will be regarded as a piece of evidence, and they will together form a complete body of evidence. Therefore, the most important part of this module is to discuss how making the classifier generate the possible probability of each category (Sect. 3.1).

  3. 3.

    Fusion of evidences. Since D-S evidence theory may produce counter-intuitive results when dealing with highly conflicting evidence, we will first judge whether there are conflicting evidence pairs in the body of evidence. If not, the evidence will be directly fused by using Dempster’s combination rule. Otherwise, we will generate weighted average evidence based on the credibility degree and information volume of each piece of evidence in the body of evidence, and then fuse the weighted average evidence by Dempster’s combination rule (Sect. 3.2).

3.1 Generating BPAs for Each Classifier

According to the principle of each classifier, their basic probability assignment for the prediction category will be obtained. To obtain the classification probability of SVM, Platt [28] proposes a method of estimating the posterior probability by using a sigmoid function to map the output of SVM to the interval [0, 1]. To obtain the classification probability of KNN, the probability of KNN recognition category is very easy to obtain according to its classification principle which takes the category of the majority item in the nearest k neighbors as the classification result. To obtain the classification probability of ESN and ELM, since the neural network converts the classification problem into regression for analysis, then converts the regression results into categories to output, we estimate the posterior probability of the category based on the distance between the forecast value and the target value.

Probability SVM Platt uses the sigmoid-fitting method to post-process the output of standard SVM and convert it into posterior probability, and its definition is as follows:

$$\begin{aligned} P(y=1|f)=\frac{1}{1+e^{Af+B}} \end{aligned}$$
(6)

Here, f is the unthreshold output of sample input, and A, B are the values estimated with maximum likelihood estimate which is as follows:

$$\begin{aligned} min -\sum _i t(f_i)log(P(f_i))+(1-t(f_i))log(1-P(f_i)) \end{aligned}$$
(7)

with

$$\begin{aligned} t_i=\left\{ \begin{array}{rl} \frac{N_+ + 1}{N_+ + 2}, y_i=+1 \\ \frac{1}{N_\_ +2}, y_i=+1 \end{array} \right. i=1,2,3,\ldots n \end{aligned}$$
(8)

where \(P(f_i)=P(y=1|f_i)\), \(N_+\) is the number of positive samples(\(y_i\)=+1) and \(N_\_\) is the number of negative samples(\(y_i\)=-1).

Probability KNN In the absolute classification of KNN, judging the category of the new instance is to find the k instances that are most similar to the new instance based on the known data. If most of these k instances belong to a certain category, then the new instance will be the category.

However, when the new instance is at the boundary of two or more categories, no matter which category it belongs to, it is likely to be wrong. At this time, it will be more meaningful to output the possible probability of each category. For this purpose, the k instances of known categories closest to the new instance will be found and counted how many of them are in each category and divide by k to get the probability value of the new instance belonging to each category. It is defined as follows:

$$\begin{aligned} P_i=\frac{class_i}{k} \end{aligned}$$
(9)

Here, \(P_i\) represents the probability that the new instance may be category i, \(class_i\) indicates the number of instances of category i among the k nearest neighbors.

Probability ESN and Probability ELM When dealing with classification problems, neural networks such as ESN and ELM will usually perform binary coding on the category based on One-Hot Encoding, and fit the training data, then conversed regression result will be the class label. Generally, when the artificial neural network needs to output the probability of the category, the Softmax function will be added to complete the classification model training through the feedback gradient. However, ELM is a feedforward neural network, and ESN uses the linear fitting of random neurons to simulate the nonlinear prediction ability. If the Softmax function is added, the feedback gradient may greatly reduce the performance of the two classifiers. Therefore, a distance-based classification probability mapping function is proposed in this paper.

Here, we obtain an intuitive mapping function by analyzing the distance between the prediction result and the expected value. Fig. 2a shows the statistical information of the forecasting value of ELM on Iris dataset. Here, the forecasting n\(\times \)3(n is the number of instances, ’3’ is the number of categories) matrix is converted into a vector, then sort the vector in ascending order, finally use equation \({\bar{y}}=1-|1-y|\)(y indicates each forecasting value) to convert the predicted value to the distance in (-\(\infty \), 1].

Fig. 2
figure 2

a Distribution of absolute distance between the expected value and forecasting value; b Schematic diagram of normal distribution

In Fig. 2a, the horizontal axis represents the index of the sort result, and the vertical axis represents the converted distance. The closer the points in Fig. 2a are to the labeling threshold, the denser the distribution is, which is very similar to the normal distribution in Fig. 2b. Therefore, we consider using the normal distribution function to estimate the probability when the corresponding category is true, i.e., the posterior probability of the corresponding category. Since the label True is 1 and the label False is 0, the mapping method based on normal function(\(X\sim N(\mu ,\sigma ^2)\)) can be obtained by setting the parameters of normal function. Here, we set \(\mu \) = 1, \(\sigma \) = 1/3, and the mapping equation is as follows:

$$\begin{aligned} \hat{P_i}=\frac{3}{\sqrt{2\pi }}e^{-\frac{9(y_i-1)^2}{2}} \end{aligned}$$
(10)

Here, \(y_i\) represents the forecast value of the i-th category by ESN or ELM. Normalize \(\hat{P_i}\) to get \(P_i\), which is the estimated value of the posterior probability of the i-th category.

$$\begin{aligned} P_i=\frac{\hat{P_i}}{\sum _i{\hat{P_i}}} \end{aligned}$$
(11)

3.2 Fusion of Evidence

After obtaining the BPAs, how to fuse these redundant and complementary evidences to generate more valuable classification results is another difficulty in this paper. Here, D-S evidence theory is used to fuse the output of multiple classifiers. However, when there is highly conflicting evidence, it will inevitably make mistakes and draw counter-intuitive conclusions. Therefore, it is very necessary to pre-process the conflicting evidence before fusion. The degree of conflict between the body of evidence can be measured by the normalization constant K. The result of preprocessing conflict evidence and the original evidence without conflict will be fused through Dempster’s rule, and the fusion category probability will be used to decide the label of a new instance.

Fig. 3
figure 3

Flow chart of fusion evidence

figure a

In order to have an intuitive understanding of the process of fusing pieces of evidence, we have drawn the flow chart of fusing pieces of evidence as shown in Fig. 3. Firstly, the output value BPAi (i = 1, 2, 3, 4.) of the probabilistic classifiers SVM, ESN, KNN, and ELM is used as a piece of evidence to form the mutually independent and complete body of evidence. Next, the algorithm will judge the conflict degree in the body of evidence according to the normalized constant K. Here, K is the judgment condition for the acceptability of highly conflicting evidence [27, 41]. When \(K < 0.95\), it indicates that the conflict is within the acceptable range, the body of evidence can be directly fused through Dempster’s rule. Otherwise, it is very necessary to preprocess the conflicting evidence before fusion. In the preprocessing of conflict evidence, the weight of each evidence is generated based on two aspects: credibility degree and information volume. The credibility degree is obtained from the distance measure matrix of evidence, and information volume is another form of information entropy. Then according to the weight of the evidence, the weighted average evidence will be generated as the preprocessing result of the conflict evidence. Finally, the fusion result is converted into category labels to be used as the forecast result of the multi-classifier ensemble algorithm. The pseudo-code of the evidence fusion process is shown in Algorithm 1.

4 Experiment and Analysis

In order to verify the multi-classifier ensemble algorithm proposed in this paper, comparative experiments have been completed on eight public data sets. The data sets, experimental setup, results, and analysis will all be described in this section.

4.1 Data Set and Experimental Settings

Data Set The attributes of the eight public data sets in UCI database are as follows:

  1. 1.

    Post is an abbreviated form of Postoperative Patient Data, and its classification task is to determine where patients in the postoperative recovery area should be sent;

  2. 2.

    Nursery is an abbreviated form of Nursery Database, and it is derived from the hierarchical decision model originally developed for the ranking of nurseries’ applications;

  3. 3.

    Iris is an abbreviated form of Iris Data Set, and it is the most famous database in the pattern recognition literature that contains the attribute information of three iris plants;

  4. 4.

    Wine is an abbreviated form of Wine Recognition Data, and it is the result of the chemical analysis of wines originated in the same region in Italy, which draw from three different varieties;

  5. 5.

    Breast is an abbreviated form of Breast Tissue Data Set, and it is a data set with electrical impedance measurements of fresh tissue samples excised from the breast;

  6. 6.

    Banance is an abbreviated form of Balance Scale Data Set, and it is a database of balance weight and distance.

  7. 7.

    Hayes is an abbreviated form of Hayes-Roth & Hayes-Roth (1977) Database, and it is a theme library of human subjects study;

  8. 8.

    Page is an abbreviated form of Page Blocks Classification Data Set, and it is the problem including classifying all the blocks of the document page layout detected by the segmentation process.

Table 1 Data set overview

Table 1 shows some information about the eight public data sets during the analysis process. In the Table 1, the first column is the name of the data set; the second column is the distribution of the number of records in different categories for the corresponding data set (the symbol “/” distinguishes different categories, and the number indicates the number of instances in different categories); the third column is the number of feature dimensions of the corresponding dataset; the last column is the total number of instances in the data set. In the comparative experiment in this paper, the 30% instances are randomly selected as test set for each dataset, and the rest as the training set.

Experimental Set In the comparative experiment, we set the same parameters for each dataset and run the program in python 2.7. In SVM, the error value tol for stopping training is \(1e-3\), and the penalty is set to l2. In ESN, the number of reservoir neurons is 200, the spectral radius of the recurrent weight matrix is 0.25, the proportion of recurrent weights is set to 0.95, and the output activation function is tanh. In KNN, the number of neighbors is set to 5, and the power parameter for the Minkowski metric is 2. In ELM, the number of hidden layer nodes is 200, and the activation function is tanh. In addition, the three indicators used as classifier evaluation: Accuracy(ACC), F1 score(F1), and Area Under the Receiver Operating Characteristic curve(AUROC) are implemented by using “sklearn.metrics” interface.

4.2 Experimental Results and Discussion

On the eight public data sets, the training set and the test set are randomly divided according to the test set ratio 30%, and the classification accuracy of 100 repeated experiments is plotted on Fig. 4. Its subgraphs (a), (b), (c), (d), (e), (f), (g), and (h) correspond to the classification accuracy of 100 random repeated experiments of Post, Nursery, Iris, Wine, Breast, Banance, Hayes, and Page data sets respectively. Here, the horizontal axis marks the index of the repeated trials, the vertical axis is the classification accuracy(ACC) of SVM, ESN, KNN, ELM, Voting-based method, and the proposed ensemble algorithm which distinguishes by the legend in the upper right corner of each subgraph.

Fig. 4
figure 4

Repeated experiments on eight public datasets

We analyze the results shown in Fig. 4 and found that our proposed ensemble algorithm achieve the best classification results in most cases, even if a base classifier is almost invalid in the scenario shown in Fig. 4b, e. Compared with the Voting-based ensemble method, the proposed algorithm also achieves significantly better performance in multiple scenarios, as shown in six subgraph (a), (b), (e), (f), (g) and (h) of Fig. 4. Therefore, the comprehensive analysis can draw the conclusion that the multi-classifier ensemble classification algorithm proposed in this paper has quite good performance in the classification accuracy of different applications.

In order to further explore the mechanism why the proposed multi-classifier ensemble algorithm can have better classification performance, we have listed three typical examples of conflicting evidence fusion in data set Post in Table 2. “No.1”, “No.2” and “No.3” mark three classification instances in Post. “BPAs” and “Target” respectively correspond to the basic probability assignment of each category with the symbol “/” to distinguish and the recognized category label. The last line “Real” is the real label of the instance. In Table 2, when a classifier generates BPAs that conflict with other evidence in the classification task of three instances, the fusion weighted average evidence can often get the correct classification result. Based on this idea, this paper proposes a multi-classifier fusion classification algorithm.

In order to further analyze the performance of the proposed algorithm, the accuracy(ACC), F1 score(F1), and area under the receiver operating characteristic curve(AUROC) of 100 randomized trials are used to evaluate the classification performance of the proposed ensemble algorithm. In Table 3, the evaluation values of 100 trials in eight public data sets are listed in the form of “mean ± standard deviation”. In addition, the statistics of Table 3, additionally including the significance of the proposed algorithm relative to other ensemble methods, are visualized in Fig. 5. The horizontal lines on each subplot show the significance (p-value) of the proposed algorithm relative to other ensemble methods on the three statistical values of ACC, F1, and AUROC. The red horizontal line indicates that both are significant, and the value on the horizontal line is the p-value, “*” indicates moderately significant, and “**” indicates very significant. By reasonably analyzing the statistical results of repeated experiments, it is found that although the proposed algorithm has few advantages over other methods in the accuracy and F1 score of wine data set, and AUROC of page data set, it shows the best performance in the evaluation indexes of most data sets. In general, the proposed multi-classifier fusion idea based on D-S evidence theory achieves higher performance than all single classifiers, the Voting-based method, and the Multi-modal method. This further verifies that the previous conclusion is correct, i.e., the proposed algorithm has better classification performance than others.

Table 2 Conflict evidence fusion for Post
Table 3 Comparison of experimental results

Through the above analysis and discussion, reasonable conclusions can be drawn as follows:

  1. 1.

    A single classifier is limited by its analysis perspective, which is determined by its basic principles, and often leads to uncertain and fuzzy decisions in real and complex classification tasks;

  2. 2.

    Fusion of conflicting evidence based on D-S evidence theory can largely avoid the bias of a single classifier, and infer more credible artificial intelligence decisions;

  3. 3.

    The proposed multi-classifier ensemble algorithm performs better than all single classifiers, the Voting-based method, and the Multi-modal method on eight public data sets with different attributes and achieves better classification performance.

Fig. 5
figure 5

The performance of different methods in different data sets

In summary, whether the line graph of the accuracy of 100 repeated experiments, the conflict evidence fusion process of the classification examples, and the statistical table of the experimental data all believe that the multi-classifier ensemble algorithm proposed in this paper achieves the best classification performance than others.

5 Conclusion and Outlook

In this paper, a multi-classifier ensemble algorithm based on D-S evidence theory is proposed. The main contributions are as follows: the mapping method between the base classifier and the posterior probability of their prediction results is developed to form four probability classifiers to obtain multi-source complementary evidence; On the basis of measuring the conflict evidence by using the conflict coefficient, the body of evidence will be distinguished and fused to obtain the final classification result. In the comparative experiment, eight different public data sets are introduced to test the performance of the proposed algorithm in different applications by Accuracy, F1 and AUROC. The experimental results confirm that the proposed multi-classifier ensemble algorithm performs better than all single classifiers, the Voting-based method, and the Multi-modal method in classification performance of different application scenarios.

In future research, we will further explore two aspects: First, how does the number of classifiers affect the performance of the ensemble method? Second, when the neural network with reverse recursion mechanism is used as the basic classifier, the parameter optimization method of the basic classifier will be studied.