Abstract
In this paper new algorithm called Bayes metaclassifier (BMC) will be introduced as a method for improving weak classifiers performance. In general, BMC constitutes the probabilistic generalization of any base classifier and has the form of the Bayes scheme. To validate BMC classification two experiments were designed. In the first one three synthetic datasets were generated from normal distribution to calculate and check empirically upper bound for improving base classifier when BMC approach is applied. Furthermore, to validate usefulness of this algorithm extensive simulations from 22 available benchmarks were performed comparing BMC model against 8 base classifiers with different design paradigms.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Pattern recognition task was successfully applied in many areas from medical, financial and critical safety applications. Each problem requires its individual approach and there is no easy solution which classifiers or algorithm should be applied because each method has its prons and cons. Nowadays, due to the huge development of the artificial intelligence, many classification methods are at hand, there are statistical, distance based, neural methods to name only a few [4, 6, 14]. Since classification problems are very complex and multidimensional, simple base classifier usage is not sufficient so advanced method for building multiclassifier systems (MC) are developed, but still proper base classifier choice is critical because we need accurate and diverse ones [19]. In the literature one can find many method for constructing robust MC models, but the most commonly used are bagging, boosting and random subspace [5, 12, 13]. In shortly, bagging applies sampling with replacement to obtain independent training datasets for each individual classifier. Boosting modifies the input data distribution processed by each classifier in a sequence from the results of classifiers trained before, paying more attention on difficult samples. All the aforementioned algorithms focus on using weak classifiers for building powerful MC systems. In this paper we introduce an algorithm called Bayes metaclassifier (BMC) as a method for improving weak classifier in terms of its classification performance. In general, BMC constitutes the probabilistic generalization of any base classifier independent of its design paradigm and has the form of the Bayes scheme. Since BMC provides probabilistic interpretation for base classifier correct classification and misclassification, this method can be used in sequential classification or as a fusing mechanism in MC systems [10, 11]. The main goal of this study is to investigate how BMC can improve base classifier performance. For this purpose two experiments were performed, the first one with synthetic data and the second one on 22 real life benchmark datasets. The paper is divided into four sections and organized as follows. In Sect. 2 pattern recognition task is introduced and later the concept of probabilistic Bayes metaclassifier is described. The results of computer experiments are described in Sect. 3 and Sect. 4 concludes the paper together with proposition of future plans and improvements.
2 Problem Statement
2.1 Pattern Recognition Task
This paper deals with pattern recognition task in which we assume that the pattern is in class \(j \in \mathcal {M}\), where \(\mathcal {M}\) is an m-element set of possible states numbered with the successive natural numbers (\(j \in \mathcal {M} = \{1, 2, \ldots , M\})\). Label j is unknown and does not undergo our direct observation. What we can only observe are the features by which an object manifests itself. We will denote a d-dimensional measured feature vector by \(x \in \mathcal {X}\) (thus \(\mathcal {X}\) is the feature vector space). In order to classify unknown patterns, as usual in practice, we assume that we have to our disposal so called training set, which in the investigated decision task consists of N training patterns:
where \(x_{k}, j_{k}\) denote d-dimensional pattern and its true classification, respectively. In general, the decision algorithm with learning should use every time as well observed data i.e. the feature vector x as the knowledge included in the training set \(\mathcal {S}\). In consequence, the general, classification algorithm with learning is of the following form:
\(\psi \) is a canonical model of a base classifier which means that for a given \(x \in \mathcal {X}\), it produces class label and a vector of class supports [8]. Single \(d_j(x)\) indicates the support of \(\psi \) to the hypothesis that the object x belongs to the class j.
2.2 BMC Algorithm Construction
In this section BMC algorithm construction over single classifier will be presented. Firstly, the probabilistic model of classification must be introduced in which the feature vector \(x \in \mathcal {X}\) and class label \(j \in \mathcal {M}\) are observed values of the pair of random variables \(\mathbf X \) and \(\mathbf J \), respectively. The probability distribution of \((\mathbf X , \mathbf J )\) is determined by the a priori class probabilities \(p_j=P(\mathbf J =j)\) and class-conditional density functions \(f(x|j)=f_j(x)\). The Bayes metaclassifier (BMC) \(\psi ^{BMC}\), which originally was introduced in [11], constitutes the probabilistic generalization of base classifier (2) which has the form of the Bayes scheme built over the classifier \(\psi \). This means, that \(\psi ^{BMC}\) takes the decision according to the maximum a posteriori probability rule [10]:
A posteriori probabilities \(p(i|k) \equiv P(\mathbf J =i|\psi (x)=k), i \in \mathcal {M}\) are given by Bayes rule:
where probability \(p(k|i) \equiv P(\psi (x)=k|i)\) denotes class-dependent probability of error (if \(k \ne i\)) or correct (if \(k = i\)) classification of an object x by the base classifier \(\psi \). Placing the base classifier \(\psi \) in a probabilistic frame defined by the BMC (\(\psi ^{BMC}\)), we get a common probabilistic interpretation of responses of base classifiers, regardless of their design paradigms. The key element in the BMC scheme described by Eqs. (4) and (5), is the calculation of probabilities \(P(\psi (x)=k|i)\) at point x, i.e. class-dependent probabilities of correct and misclassification for base classifier \(\psi \). Normally, for any base deterministic classifier these probabilities are equal to 0 or 1. In this paper, the proposed method of evaluation of these probabilities is based on the original concept of a hypothetical classifier called a Randomized Reference Classifier (RRC) [16, 17]. The RRC \(\psi ^{RRC}_l(x)\) is a stochastic classifier that classifies object x according to the maximum rule for vector of class support \([\gamma _1(x), \gamma _2(x), \ldots , \gamma _M(x)]\) which are observed values of random variables (rvs) \([\varDelta _1(x), \varDelta _2(x), \ldots , \varDelta _M(x)]\). Probability distribution of rvs is chosen in such a way that RRC acts, on average, as a modeled base classifier only when the following constraints are fulfilled:
where E is the expected value operator. Since RRC performs classification in a stochastic manner, so it is possible to calculate class-dependent probabilities of correct classification \(P_c(j|x)\) and misclassification \(P_e(j|x)\) and furthermore consider them equivalent to the modeled base classifier:
It should be noted that (7) is a heuristic approach for estimating base classifier probabilities of correct classification (misclassification). For BMC training phase it is assumed that a validation set containing pairs of feature vectors and their corresponding class labels is available [15]. This set can be obtained by random ratio subsampling from training set or alternatively as a stack generalization [18]:
\(\mathcal {V}\) is used to calculate \(P(\psi (x)=k|i) \approx P(\psi ^{RRC}_l(x)=k|i)\) which denotes that an objects x belongs to class i given that \(\psi (x)=k\). These values are only known at discrete points from \(\mathcal {V}\) so to enable dynamic calculation of any new object x from feature space we need a neighborhood function describing how probabilities at validation points affects new x. In this study, Gaussian potential function was used:
where \(\alpha \) value in Eq. (9) is a scaling factor and should be adjusted independently to classification problem. In this article, instead of manual selection, \(\alpha \) is selected during preparation step of BMC using training and validation set. In this case, for each object from \(\mathcal {V}\) we calculate \(P_e(j|x), P_c(j|x)\) and further we randomly select \(40\%\) of objects from training set with stratification and use this new set to check for which \(\alpha \) BMC obtains the highest classification rate. For all experiments, \(\alpha \) was chosen from set: \(\alpha = \{1, 2, 4, 6, 8\}\).
3 Experiments
The goal of this study was twofold. Firstly, we wanted to check how BMC works with synthetically generated data coming from 1D normal distribution described by (\(\mu , \sigma \)). This is very simplified approach, but allows us to calculate an optimal Bayes decision boundary and Bayes probability of misclassification. In a consequence, we can determine empirical probability of error of BMC in relation to this probability. In the second experiment BMC model was applied to 22 benchmark data sets taken from the UCI Machine Learning Repository [3], Ludmila Kuncheva Repository [9] and Keel [1] for 8 classifiers with different decision paradigm. A brief description of the data sets used is given in Table 1.
3.1 Experimental Setup
In the first experiment three synthetic datasets with two classes were generated from normal distribution \(N(\mu , \sigma )\) using different (\(\mu , \sigma \)) (see Fig. 1). For generated set it was confirmed (for \(p\ll 0.01\)) that each sample is not statistically significantly different from normal distribution using Kolmogorow Smirnow, Lillieforsa and W Shapiro-Wilka statistical tests. For simulations, 50k and 10k objects were generated for each class for testing and validation sets, and a’priori class probabilities were equal. Each simulation was repeated 10 times and results were averaged. Since we deal with scalar values, we use linear classifier (\(\psi _l\)) which decision boundary location is determined by point \(x=p\) so training procedure is not needed. Furthermore, using \(\mu _{1, 2}, \sigma _{1, 2}\) for each model an optimal Bayes decision boundary can be calculated and set as \(p_{optimal}\).
For the second experiment the following base classifiers were used:
-
qdc - quadratic classifier based on normal distributions with the same covariance matrix for each class,
-
nmc - nearest mean classifier,
-
6-nn - 6 nearest neighbors,
-
parzen - Parzen classifier,
-
dtc - Decision tree with information gain splitting criterion,
-
bpxn-1 - Feed-forward back-propagation neural network with 1 hidden layer and the number of learning epochs set to 120,
-
bpxn-2 - Feed-forward back-propagation neural network with 2 hidden layer and the number of learning epochs set to 120,
-
svm - support vector machine with 3rd order polynomial kernel.
For each data set from the second experiment, feature vectors were normalized to zero mean and unit standard deviation. Simulations were conducted according to \(5\,\times \,2\) cross-validation method for extracting training and testing sets from each data set [2]. For the comparison purposes the following approach was applied. As a reference, each base classifier was trained using whole training set, while for BMC preparation, from available training set \(30\%\) of objects were selected randomly to form validation set while the rest objects were used for base classifier fitting. All experiments were conducted in MATLAB using PRTools 5.0 packet.
3.2 Experiment Results
Results for the first simulation scenario are presented in Table 2 where 7 classification error rates are shown for \(\psi _l\) with decision boundary set to: \(x=p_{optimal}-3, x=p_{optimal}-2, x=p_{optimal}-1, x=p_{optimal}, x=p_{optimal}+1, x=p_{optimal}+2, x=p_{optimal}+3\). After analyzing results in Table 2 it can be seen that the upper bound for BMC improvement is Bayes error calculated for optimal decision boundary. Higher correction can be obtained for classifier \(\psi _l\) when decision boundary is farther from the optimal. One should keep in mind, that this experiment is very simple, but it can prove that BMC works according to Bayes scheme. Table 3 shows results for base classifiers versus BMC overlay for benchmark datasest. Classification accuracies were averaged over 5 repetitions of two–fold cross–validation. We have performed pairwise and multiple comparison statistical tests. For the first one \(5\,\times \,2\) CV F-test was applied to indicate which algorithm is better. In Table 3 \(+/=/-\) means that the BMC system is statistically significantly better/no statistically significant difference/statistically significantly worse, respectively. For assessing the ranks of proposed methods over all examined benchmarks, we have used Friedman ranking test and after that the Shaffer post-hoc test to check which of the tested methods are distinctive among nxn comparisons [7]. The level of \(p<0.05\) was considered as statistically significant. From the Friedman ranking test, which checks the overall performance on all available datatsets BMC proposal was 5 out of 8 cases statistically significantly better than a base classifier. Major improvement can be observed for problems: 1, 5, 8, 9, 13, 14, 15, 17, 18. This indicates that BMC works better with balanced datasets, while for imbalanced problems application of a’priori probabilities in BMC decision scheme causes to prefer majority class. For imbalanced problems such as: 12, 21, 22 BMC does not improve classifier accuracy. This can be explained by the fact that random sampling from training set is used to create validation set. In this case, \(\mathcal {V}\) contains mainly examples from the majority class. Another important conclusion coming from this experiment is presented in Fig. 2 which shows a box plot for an exemplary laryngeal database for selected base classifier versus BMC algorithm. It indicates that apart from improving classification performance also BMC variance between subsequent cross-validation sets is decreased.
4 Conclusions
In this paper a detailed BMC algorithm construction was presented and validated. The first experiment confirmed that the upper bound of BMC improvement over base classifier is the Bayes error. This is not very suprising since BMC works according to Bayes scheme maximizing a’posteriori probabilities for decision of a base classifier. From three synthetic dataset analysis it is straightforward that BMC performance directly depends on the classification quality of a base classifier \(\psi \). Further experiment on benchmark datasets shows that BMC can significantly correct base classifier classification, especially for cases where base classifier is not optimal. Overall performance according to Friedman ranking test indicates that BMC proposal was statistically significantly better 5 out of 8 different classifiers. Additionally, when analyzing results from each cross-validation phase, it is clearly visible that BMC decreases classification variance comparing to its base counterpart results. On the other hand, the main drawback of BMC involves proper validation set generation. BMC works better with balanced datasets, while for imbalanced problems usage of a’priori probabilities in BMC design, causes that its decision boundary is moved towards majority class. To sum up, since BMC provides probabilistic interpretation for a base classifier response for correct and incorrect classification this method can be directly used in a sequential classification problems, or in MC systems, especially during classifier fusion. As a future work, we would like to apply BMC for ensemble construction at both crisp and continuous level of classifier response.
References
Alcala-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcia, S., Sanchez, L., Herrera, F.: Keel data-mining software tool: data set repository. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)
Alpaydin, E.: Combined \(5 \,\times \, 2\) cv F test for comparing supervised classification learning algorithms. J. Neural Comput. 11, 1885–1892 (1999)
Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://www.ics.uci.edu/mlearn/MLRepository.html
Duda, R., Hart, P., Stork, D.: Pattern Classification, 3rd edn. Wiley-Interscience, New York (2001)
Freund, Y.: Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press Professional Inc., San Diego (1990)
Garcia, S., Fernandez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180, 2044–2064 (2010)
Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms, 3rd edn. Wiley-Interscience, New York (2004)
Kuncheva, L.: Real medical data sets repository (2004). http://pages.bangor.ac.uk/~mas00a/activities/real_data.htm
Kurzynski, M., Majak, M.: Meta-Bayes classifier with Markov model applied to the control of bioprosthetic hand. In: Czarnowski, I., Caballero, A.M., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies 2016. SIST, vol. 57, pp. 107–117. Springer, Cham (2016). doi:10.1007/978-3-319-39627-9_10
Kurzynski, M., Majak, M., Zolnierek, A.: Multiclassifier systems applied to the computer-aided sequential medical diagnosis. J. Biocybern. Biomed. Eng. 36, 619–625 (2016)
Skurichina, M., Duin, R.P.W.: Bagging, boosting and the random subspace method for linear classifiers. Pattern Anal. Appl. 5(2), 121–135 (2002)
Skurichina, M., Duin, R.P.W.: Bagging and the random subspace method for redundant feature spaces. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp. 1–10. Springer, Heidelberg (2001). doi:10.1007/3-540-48219-9_1
Strunk Jr., W., White, E.B.: Statistical Decision Theory and Bayesian Analysis, 3rd edn. Springer, New York (1987)
Woloszynski, T.: Classifier competence based on probabilistic modeling (ccprmod.m) (2013)
Woloszynski, T., Kurzynski, M.: A probabilistic model of classifier competence for dynamic ensemble selection. Pattern Recogn. 44, 2656–2668 (2011)
Woloszynski, T., Kurzynski, M.: A measure of competence based on random classification for dynamic ensemble selection. Inf. Fusion 13, 207–213 (2012)
Wolpert, D.: Stacked generalization. Neural Netw. 5, 241–259 (1992)
Wozniak, M., Grana, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Acknowledgments
This work was supported by the statutory funds of the Department of Systems and Computer Networks, Wroclaw University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Majak, M., Kurzyński, M. (2018). On a New Method for Improving Weak Classifiers Using Bayes Metaclassifier. In: Kurzynski, M., Wozniak, M., Burduk, R. (eds) Proceedings of the 10th International Conference on Computer Recognition Systems CORES 2017. CORES 2017. Advances in Intelligent Systems and Computing, vol 578. Springer, Cham. https://doi.org/10.1007/978-3-319-59162-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-59162-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59161-2
Online ISBN: 978-3-319-59162-9
eBook Packages: EngineeringEngineering (R0)