Keywords

1 Introduction

The pattern recognition task is one of the trends in research on machine learning  [1]. The classification task can be accomplished by a single classifier or by a team of classifiers. In the literature, the use of multiple classifiers for a decision problem is known as MCSs or an ensemble of classifiers  [4, 11, 27]. These methods are popular for their ability to fuse together multiple classification outputs for better accuracy of classification. The building of MCSs consists of three phases: generation, selection, and integration [3]. The final decision which is made in the third phase uses the prediction of the selected classifiers. The output of an individual classifier can be divided into three types [17].

  • The abstract level—classifier \(\psi \) assigns the unique label j to a given input x.

  • The rank level—in this case for each input x, each classifier produces an integer rank array. Each element within this array corresponds to one of the defined class labels. The array is usually sorted with the label at the top being the first choice.

  • The measurement level—the output of a classifier is represented by a measurement value that addresses the degree of assigning the class label to the given output x. An example of such a representation of the output is a posteriori probability returned by Bayes classifier.

Following these three types of outputs of the base classifier, various problems of combination function (third phase) of classifier outputs are considered. The problems studied in [19, 25] belong to the abstract level. The combining outputs for the rank level are presented in [13] and problems studied in [16, 18] belong to the last level. The selection of classifiers is one of the important problems in the creation of EoC [15, 24]. This task is related to the choice of a set of classifiers from all the available pool of classifiers. Formally, if we choose one classifier, it is a classifier selection process. But if we choose a subset of classifiers from the pool, then it is an ensemble selection. In this work, these terms will be used interchangeably. Here you can distinguish between the static or dynamic selection [22, 26]. In the static classifier selection one set of classifiers is selected to create EoC. This EoC is used in the classification of all the objects from the testing set. The main problem in this case is to find a pertinent objective function for selecting the classifiers. One of the best objective functions for the abstract level of classifier outputs is the simple majority voting error [23]. In the dynamic classifier selection for each unknown sample a specific subset of classifiers is selected [5]. It means that we are selecting different EoCs for different objects from the testing set. In this type of classifier selection, the classifier is chosen and assigned to the sample based on different features [28] or different decision regions [7, 14]. In this work we will consider the dynamic ensemble selection. In detail, we propose the new selection method based on the analysis of the discriminant functions in the contents of the binary classification. The paper presents two approaches: one takes into account the normalization of the discrimination functions. In the second approach, normalization is not performed. The text is organized as follows: in Sect. 2 the ensemble of classifiers and combination functions based on the sum method are presented. Section 3 contains the new method for the dynamic ensemble selection based on the analysis of the discriminant functions. Section 4 includes the description of research experiments comparing the suggested algorithms with base classifiers. Finally, the discussion and conclusions from the experiments are presented.

2 Ensemble of Classifiers

Let us consider the binary classification task. It means that we have two class labels \(M=\{1, 2\}\). Each pattern is characterized by a feature vector X. The recognition algorithm maps the feature space X to the set of class labels M according to the general formula:

$$\begin{aligned} \varPsi :X \rightarrow M. \end{aligned}$$
(1)

Let us assume that K different classifiers \(\varPsi _1,\varPsi _2,\ldots ,\varPsi _K\) are available to solve the classification task. In MCSs these classifiers are called base classifiers. In the binary classification task K is assumed to be an odd number. As a result, for all the classifiers’ actions, their K responses are obtained. The output information from all K component classifiers is applied to make the ultimate decision of MCSs. This decision is made based on the predictions of all the base classifiers. One of the possible methods for integrating the output of the base classifier is the sum rule. In this method the score of MCSs is based on the application of the following sums:

$$\begin{aligned} s_i(x)=\sum _{k=1}^{K}\hat{p}_k(i|x), \qquad i \in M, \end{aligned}$$
(2)

where \(\hat{p}_k(i|x)\) is an estimate of the discrimination functions for class label i returned by classifier k. The final decision of the MCSs is made following the maximum rule:

$$\begin{aligned} \varPsi _{S}(x)= \arg \max _{i} s_i(x). \end{aligned}$$
(3)

In the presented method (3) the discrimination functions obtained from the individual classifiers take an equal part in building MCSs. This is the simplest situation in which we do not need additional information about the testing process of the base classifiers except for the models of these classifiers. One of the possible methods in which weights of the base classifier are used is presented in [2].

2.1 Selection of Discrimination Functions

The classifier selection phase is often criticized due to usually observed high computational cost [3]. Bearing this in mind, we propose the dynamic selection of discrimination functions, which has minor computing requirements. It uses the assumption that the more credible the classifier, the larger the differences in the discrimination function. In the discussed binary classification task this means that during the selection we compare the discriminant functions \(|\hat{p}_k(1|x)-\hat{p}_k(2|x)|<\alpha \). The parameter \(\alpha \) determines the size of the difference in the discriminant functions. The values are derived from the interval \(<\alpha \in [0,1)\). For value \(\alpha =0\) the selection process does not occur. The proposed selection process is performed before the calculation of the coefficients \(s_i(x)\). The final decision is made according to the sum rule, and MCSs with the selection is labelled as \(\varPsi _{S_{ON}}^{\alpha }\). In MSCs model labeled as \(\varPsi _{S_{N}}^{\alpha }\), in addition the normalization of the discrimination functions is carried out. Normalization is performed for each label class i according to the rule:

$$\begin{aligned} \widehat{p'}_k(i|x)=\frac{\hat{p}_k(i|x)- \min (\hat{p}_1(i|x),\ldots ,\hat{p}_k(i|x))}{\max (\hat{p}_1(i|x),\ldots ,\hat{p}_k(i|x))-\min (\hat{p}_1(i|x),\ldots ,\hat{p}_k(i|x))}, \qquad k \in K. \end{aligned}$$
(4)

The aim of the experiments presented in the next section was among other things, to compare two proposed methods of selection.

Table 1 Description of data sets selected for the experiments
Table 2 Classification accuracy and mean rank positions for the proposed selection algorithm with normalization produced by the Friedman test
Table 3 Classification accuracy and mean rank positions for the proposed selection algorithm without normalization produced by the Friedman test
Table 4 Classification accuracy and mean rank positions for the base classifiers (\(\varPsi _1,\ldots ,\varPsi _7\)) and MCSs algorithms produced by the Friedman test

3 Experimental Studies

A series of experiments were carried out to illustrate the quality of classifications. The aim of the experiments was to compare the proposed selection method algorithms with the base classifiers and ensemble classifiers based on the majority voting and sum methods. In the experiential research 11 benchmark data sets were used. Two of them were generated randomly—they are the so-called Banana and Highleyman sets. The other nine benchmark data sets come from the UCI repository [10]. All the data sets concern the binary classification task. The numbers of attributes, examples, and ration in the classes are presented in Table 1. The studies did not include the impact of the feature selection process on the quality of classifications. Therefore, the feature selection process [12, 21] was not performed. The results are obtained via tenfold-cross-validation method. In experiment 7, base classifiers were used. Two of them work according to \(k-NN\) rule where the k parameter is equal to 5 or 7. Two base classifiers are used as decision trees algorithms, with the number of branches denoted as 2 and the depth of the precision tree having at most 6 levels. In the decision-making nodes the Gini index or entropy is used. One of the classifiers uses a combination of the decision trees’ base models. It is Gradient boosting algorithm. The other two classifiers use Support Vector Machines models. One of them uses Least Squares SVM method and the second Decomposed Quadratic Programming method. The experiments were performed in an SAS Enterprise Miner environment. Table 2 shows the results of the classification for the proposed classifier selection with normalization of the posteriori probability functions. Additionally, the mean ranks obtained by the Friedman test were presented. The values show that the best value of the parameter \(\alpha \) is 0.4 for that classifier selection method. Table 3 shows the results for the case where there is no normalization of the posteriori probability functions. For this case the optimal value of the parameter \(\alpha \) is 0.1. Classifiers with the selected values of parameter \(\alpha \) were compared with the base classifiers and ensemble methods based on the majority voting and sum methods. The results of classification with the mean ranks obtained by the Friedman test are presented in Table 4. To compare the results the post-hoc Nemenyi test was used [24]. The critical difference for this test at \(p=0.05\) is equal to \(CD=4.51\). Since the difference between \(\varPsi _{Sum}\) and the proposed algorithms \(\varPsi _{S_N}^{0.4}\), \(\varPsi _{S_{ON}}^{0.1}\) is already smaller than 4.51, we can conclude that the post-hoc test is not powerful enough to detect any significant differences between these algorithms. However, the proposed selection of the posteriori probability functions algorithm achieved the best result, which is the lowest average rank (Table 3).

4 Conclusion

This paper discusses the classifier selection for the binary classification task. The proposal in the paper process concerns the selection of the posteriori probability functions. The paper presents two approaches. In one of them normalization of the posteriori probability functions is carried out. In the second case under consideration, normalization is not performed. The distributed computing approaches enable the efficient and parallel processing of the complicated data analysis task, also in the context of classification systems with multiple classifiers [20]. The methods of classifier selection presented in the paper show the ability to work in a parallel and distributed environment. Parallel processing provides the possibility to speed up the selection of the posteriori probability functions, whose results are needed to make the decision by the classifier ensemble. Additionally, the proposed approach can be applied in various practical tasks involving multiple elementary classification tasks  [6, 8, 9]. In the paper several experiments were carried out on the data sets available from the UCI repository and on the synthetical data sets. The aim of the experiments was to compare the proposed selection method algorithms with the base classifiers and ensemble classifiers based on the majority voting and sum methods. For the proposed selection method, we obtained improvement of the classification quality measured by average values from the Friedman test. However, the difference in average ranks is too small to detect statistically significant differences between the proposed selection method and the ensemble method based on the sum rule.