Keywords

1 Introduction

Pattern recognition task was successfully applied in many areas from medical, financial and critical safety applications. Each problem requires its individual approach and there is no easy solution which classifiers or algorithm should be applied because each method has its prons and cons. Nowadays, due to the huge development of the artificial intelligence, many classification methods are at hand, there are statistical, distance based, neural methods to name only a few [4, 6, 14]. Since classification problems are very complex and multidimensional, simple base classifier usage is not sufficient so advanced method for building multiclassifier systems (MC) are developed, but still proper base classifier choice is critical because we need accurate and diverse ones [19]. In the literature one can find many method for constructing robust MC models, but the most commonly used are bagging, boosting and random subspace [5, 12, 13]. In shortly, bagging applies sampling with replacement to obtain independent training datasets for each individual classifier. Boosting modifies the input data distribution processed by each classifier in a sequence from the results of classifiers trained before, paying more attention on difficult samples. All the aforementioned algorithms focus on using weak classifiers for building powerful MC systems. In this paper we introduce an algorithm called Bayes metaclassifier (BMC) as a method for improving weak classifier in terms of its classification performance. In general, BMC constitutes the probabilistic generalization of any base classifier independent of its design paradigm and has the form of the Bayes scheme. Since BMC provides probabilistic interpretation for base classifier correct classification and misclassification, this method can be used in sequential classification or as a fusing mechanism in MC systems [10, 11]. The main goal of this study is to investigate how BMC can improve base classifier performance. For this purpose two experiments were performed, the first one with synthetic data and the second one on 22 real life benchmark datasets. The paper is divided into four sections and organized as follows. In Sect. 2 pattern recognition task is introduced and later the concept of probabilistic Bayes metaclassifier is described. The results of computer experiments are described in Sect. 3 and Sect. 4 concludes the paper together with proposition of future plans and improvements.

2 Problem Statement

2.1 Pattern Recognition Task

This paper deals with pattern recognition task in which we assume that the pattern is in class \(j \in \mathcal {M}\), where \(\mathcal {M}\) is an m-element set of possible states numbered with the successive natural numbers (\(j \in \mathcal {M} = \{1, 2, \ldots , M\})\). Label j is unknown and does not undergo our direct observation. What we can only observe are the features by which an object manifests itself. We will denote a d-dimensional measured feature vector by \(x \in \mathcal {X}\) (thus \(\mathcal {X}\) is the feature vector space). In order to classify unknown patterns, as usual in practice, we assume that we have to our disposal so called training set, which in the investigated decision task consists of N training patterns:

$$\begin{aligned} \mathcal {S} = \{(x_1, j_1), \ldots , (x_N, j_N)\}, \end{aligned}$$
(1)

where \(x_{k}, j_{k}\) denote d-dimensional pattern and its true classification, respectively. In general, the decision algorithm with learning should use every time as well observed data i.e. the feature vector x as the knowledge included in the training set \(\mathcal {S}\). In consequence, the general, classification algorithm with learning is of the following form:

$$\begin{aligned} i = \psi (\mathcal {S}, x), i \in \mathcal {M}. \end{aligned}$$
(2)

\(\psi \) is a canonical model of a base classifier which means that for a given \(x \in \mathcal {X}\), it produces class label and a vector of class supports [8]. Single \(d_j(x)\) indicates the support of \(\psi \) to the hypothesis that the object x belongs to the class j.

$$\begin{aligned} d(x) = [d_1(x), d_2(x), \ldots , d_M(x)], { } \sum _{i=1}^M d_i(x) = 1. \end{aligned}$$
(3)

2.2 BMC Algorithm Construction

In this section BMC algorithm construction over single classifier will be presented. Firstly, the probabilistic model of classification must be introduced in which the feature vector \(x \in \mathcal {X}\) and class label \(j \in \mathcal {M}\) are observed values of the pair of random variables \(\mathbf X \) and \(\mathbf J \), respectively. The probability distribution of \((\mathbf X , \mathbf J )\) is determined by the a priori class probabilities \(p_j=P(\mathbf J =j)\) and class-conditional density functions \(f(x|j)=f_j(x)\). The Bayes metaclassifier (BMC) \(\psi ^{BMC}\), which originally was introduced in [11], constitutes the probabilistic generalization of base classifier (2) which has the form of the Bayes scheme built over the classifier \(\psi \). This means, that \(\psi ^{BMC}\) takes the decision according to the maximum a posteriori probability rule [10]:

$$\begin{aligned} \psi ^{BMC}(\psi (x)=k|i) \longleftrightarrow p(i|\psi =k)=\max _{i \in \mathcal {M}} p(i|\psi =k). \end{aligned}$$
(4)

A posteriori probabilities \(p(i|k) \equiv P(\mathbf J =i|\psi (x)=k), i \in \mathcal {M}\) are given by Bayes rule:

$$\begin{aligned} P(\mathbf J =i|\psi =k) = \frac{p_i \; p(k|i)}{\sum _j p_j \; p(k|j) }, \end{aligned}$$
(5)

where probability \(p(k|i) \equiv P(\psi (x)=k|i)\) denotes class-dependent probability of error (if \(k \ne i\)) or correct (if \(k = i\)) classification of an object x by the base classifier \(\psi \). Placing the base classifier \(\psi \) in a probabilistic frame defined by the BMC (\(\psi ^{BMC}\)), we get a common probabilistic interpretation of responses of base classifiers, regardless of their design paradigms. The key element in the BMC scheme described by Eqs. (4) and (5), is the calculation of probabilities \(P(\psi (x)=k|i)\) at point x, i.e. class-dependent probabilities of correct and misclassification for base classifier \(\psi \). Normally, for any base deterministic classifier these probabilities are equal to 0 or 1. In this paper, the proposed method of evaluation of these probabilities is based on the original concept of a hypothetical classifier called a Randomized Reference Classifier (RRC) [16, 17]. The RRC \(\psi ^{RRC}_l(x)\) is a stochastic classifier that classifies object x according to the maximum rule for vector of class support \([\gamma _1(x), \gamma _2(x), \ldots , \gamma _M(x)]\) which are observed values of random variables (rvs) \([\varDelta _1(x), \varDelta _2(x), \ldots , \varDelta _M(x)]\). Probability distribution of rvs is chosen in such a way that RRC acts, on average, as a modeled base classifier only when the following constraints are fulfilled:

$$\begin{aligned} \begin{aligned} \varDelta _j(x) \in \langle 0, 1\rangle , { } \sum _{j=1}^M\varDelta _j(x) = 1 \\ E[\varDelta _j(x)] = d_j(x), j \in \mathcal {M}, \end{aligned} \end{aligned}$$
(6)

where E is the expected value operator. Since RRC performs classification in a stochastic manner, so it is possible to calculate class-dependent probabilities of correct classification \(P_c(j|x)\) and misclassification \(P_e(j|x)\) and furthermore consider them equivalent to the modeled base classifier:

$$\begin{aligned} P(\psi (x)=k|i) \approx P(\psi ^{RRC}(x)=k|i). \end{aligned}$$
(7)

It should be noted that (7) is a heuristic approach for estimating base classifier probabilities of correct classification (misclassification). For BMC training phase it is assumed that a validation set containing pairs of feature vectors and their corresponding class labels is available [15]. This set can be obtained by random ratio subsampling from training set or alternatively as a stack generalization [18]:

$$\begin{aligned} \mathcal {V}=\{(x_1,j_1), (x_2,j_2), \ldots ,(x_N,j_N)\}; \ \;x_k \in \mathcal {X},\ j_k \in \mathcal {M}. \end{aligned}$$
(8)

\(\mathcal {V}\) is used to calculate \(P(\psi (x)=k|i) \approx P(\psi ^{RRC}_l(x)=k|i)\) which denotes that an objects x belongs to class i given that \(\psi (x)=k\). These values are only known at discrete points from \(\mathcal {V}\) so to enable dynamic calculation of any new object x from feature space we need a neighborhood function describing how probabilities at validation points affects new x. In this study, Gaussian potential function was used:

$$\begin{aligned} \begin{aligned} P_c^{RRC}(j|x) = \frac{\sum _{x_k \in \mathcal {V}, j_k=j} P_c(j|x)\cdot \exp (-\alpha \cdot ||x, x_k||^2)}{\sum _{x_k \in \mathcal {V}, j_k=j} \exp (-\alpha \cdot ||x, x_k||^2)}, \\ \\ P_e^{RRC}(j|x) = \frac{\sum _{x_k \in \mathcal {V}, j_k \ne j} P_e(j|x)\cdot \exp (-\alpha \cdot ||x, x_k||^2)}{\sum _{x_k \in \mathcal {V}, j_k \ne j} \exp (-\alpha \cdot ||x, x_k||^2)}, \end{aligned} \end{aligned}$$
(9)

where \(\alpha \) value in Eq. (9) is a scaling factor and should be adjusted independently to classification problem. In this article, instead of manual selection, \(\alpha \) is selected during preparation step of BMC using training and validation set. In this case, for each object from \(\mathcal {V}\) we calculate \(P_e(j|x), P_c(j|x)\) and further we randomly select \(40\%\) of objects from training set with stratification and use this new set to check for which \(\alpha \) BMC obtains the highest classification rate. For all experiments, \(\alpha \) was chosen from set: \(\alpha = \{1, 2, 4, 6, 8\}\).

3 Experiments

The goal of this study was twofold. Firstly, we wanted to check how BMC works with synthetically generated data coming from 1D normal distribution described by (\(\mu , \sigma \)). This is very simplified approach, but allows us to calculate an optimal Bayes decision boundary and Bayes probability of misclassification. In a consequence, we can determine empirical probability of error of BMC in relation to this probability. In the second experiment BMC model was applied to 22 benchmark data sets taken from the UCI Machine Learning Repository [3], Ludmila Kuncheva Repository [9] and Keel [1] for 8 classifiers with different decision paradigm. A brief description of the data sets used is given in Table 1.

3.1 Experimental Setup

In the first experiment three synthetic datasets with two classes were generated from normal distribution \(N(\mu , \sigma )\) using different (\(\mu , \sigma \)) (see Fig. 1). For generated set it was confirmed (for \(p\ll 0.01\)) that each sample is not statistically significantly different from normal distribution using Kolmogorow Smirnow, Lillieforsa and W Shapiro-Wilka statistical tests. For simulations, 50k and 10k objects were generated for each class for testing and validation sets, and a’priori class probabilities were equal. Each simulation was repeated 10 times and results were averaged. Since we deal with scalar values, we use linear classifier (\(\psi _l\)) which decision boundary location is determined by point \(x=p\) so training procedure is not needed. Furthermore, using \(\mu _{1, 2}, \sigma _{1, 2}\) for each model an optimal Bayes decision boundary can be calculated and set as \(p_{optimal}\).

Fig. 1.
figure 1

Pdf functions for three synthetic datasets used in the first experiment. From the left: (1) \(\mu _1=4, \sigma _1=1.5, \mu _2=8, \sigma _2=2.0\), (2) \(\mu _1=4, \sigma _1=2.3, \mu _2=9, \sigma _2=2.0\), (3) \(\mu _1=4, \sigma _1=2.3, \mu _2=11, \sigma _2=1.8\)

For the second experiment the following base classifiers were used:

  • qdc - quadratic classifier based on normal distributions with the same covariance matrix for each class,

  • nmc - nearest mean classifier,

  • 6-nn - 6 nearest neighbors,

  • parzen - Parzen classifier,

  • dtc - Decision tree with information gain splitting criterion,

  • bpxn-1 - Feed-forward back-propagation neural network with 1 hidden layer and the number of learning epochs set to 120,

  • bpxn-2 - Feed-forward back-propagation neural network with 2 hidden layer and the number of learning epochs set to 120,

  • svm - support vector machine with 3rd order polynomial kernel.

For each data set from the second experiment, feature vectors were normalized to zero mean and unit standard deviation. Simulations were conducted according to \(5\,\times \,2\) cross-validation method for extracting training and testing sets from each data set [2]. For the comparison purposes the following approach was applied. As a reference, each base classifier was trained using whole training set, while for BMC preparation, from available training set \(30\%\) of objects were selected randomly to form validation set while the rest objects were used for base classifier fitting. All experiments were conducted in MATLAB using PRTools 5.0 packet.

Table 1. Datasets used in experiments
Table 2. Classification error rates for the first experiment with synthetic data. Numbers: 1, 2, 3 denote generated datasets. Notation used in the first column is as follows: p - dividing (boundary) point, C - an analytic calculation of the Bayes probability of error for decision boundary at point \(x=p\), \(\psi _l\) - an average classification error rate for linear classifier with decision boundary at point \(x=p\), BMC - an average classification error rate for \(\psi _l\) with applied BMC scheme.

3.2 Experiment Results

Results for the first simulation scenario are presented in Table 2 where 7 classification error rates are shown for \(\psi _l\) with decision boundary set to: \(x=p_{optimal}-3, x=p_{optimal}-2, x=p_{optimal}-1, x=p_{optimal}, x=p_{optimal}+1, x=p_{optimal}+2, x=p_{optimal}+3\). After analyzing results in Table 2 it can be seen that the upper bound for BMC improvement is Bayes error calculated for optimal decision boundary. Higher correction can be obtained for classifier \(\psi _l\) when decision boundary is farther from the optimal. One should keep in mind, that this experiment is very simple, but it can prove that BMC works according to Bayes scheme. Table 3 shows results for base classifiers versus BMC overlay for benchmark datasest. Classification accuracies were averaged over 5 repetitions of two–fold cross–validation. We have performed pairwise and multiple comparison statistical tests. For the first one \(5\,\times \,2\) CV F-test was applied to indicate which algorithm is better. In Table 3 \(+/=/-\) means that the BMC system is statistically significantly better/no statistically significant difference/statistically significantly worse, respectively. For assessing the ranks of proposed methods over all examined benchmarks, we have used Friedman ranking test and after that the Shaffer post-hoc test to check which of the tested methods are distinctive among nxn comparisons [7]. The level of \(p<0.05\) was considered as statistically significant. From the Friedman ranking test, which checks the overall performance on all available datatsets BMC proposal was 5 out of 8 cases statistically significantly better than a base classifier. Major improvement can be observed for problems: 1, 5, 8, 9, 13, 14, 15, 17, 18. This indicates that BMC works better with balanced datasets, while for imbalanced problems application of a’priori probabilities in BMC decision scheme causes to prefer majority class. For imbalanced problems such as: 12, 21, 22 BMC does not improve classifier accuracy. This can be explained by the fact that random sampling from training set is used to create validation set. In this case, \(\mathcal {V}\) contains mainly examples from the majority class. Another important conclusion coming from this experiment is presented in Fig. 2 which shows a box plot for an exemplary laryngeal database for selected base classifier versus BMC algorithm. It indicates that apart from improving classification performance also BMC variance between subsequent cross-validation sets is decreased.

Table 3. Classification accuracies (in percentage) for the second experiment using different base classifiers. For each comparison, the first column presents results for base classifier trained with the whole training set and the second one contains accuracies for BMC algorithm. \(+/=/-\) means that the BMC system is statistically significantly better/no statistically significant difference/statistically significantly worse (\(p=0.05\)). F stands for Friedman ranking test checking overall performance on all databases.
Fig. 2.
figure 2

Box plot for laryngeal dataset for selected base classifier versus BMC algorithm.

4 Conclusions

In this paper a detailed BMC algorithm construction was presented and validated. The first experiment confirmed that the upper bound of BMC improvement over base classifier is the Bayes error. This is not very suprising since BMC works according to Bayes scheme maximizing a’posteriori probabilities for decision of a base classifier. From three synthetic dataset analysis it is straightforward that BMC performance directly depends on the classification quality of a base classifier \(\psi \). Further experiment on benchmark datasets shows that BMC can significantly correct base classifier classification, especially for cases where base classifier is not optimal. Overall performance according to Friedman ranking test indicates that BMC proposal was statistically significantly better 5 out of 8 different classifiers. Additionally, when analyzing results from each cross-validation phase, it is clearly visible that BMC decreases classification variance comparing to its base counterpart results. On the other hand, the main drawback of BMC involves proper validation set generation. BMC works better with balanced datasets, while for imbalanced problems usage of a’priori probabilities in BMC design, causes that its decision boundary is moved towards majority class. To sum up, since BMC provides probabilistic interpretation for a base classifier response for correct and incorrect classification this method can be directly used in a sequential classification problems, or in MC systems, especially during classifier fusion. As a future work, we would like to apply BMC for ensemble construction at both crisp and continuous level of classifier response.