Abstract
The aim of this paper is to introduce a novel classification algorithm based on distance to class centroids with weighted Euclidean distance metric. Features are weighted by their predictive powers and in-class homogeneities. For predictive power, information value metric is used. For in-class homogeneity different measures are used. The algorithm is memory based but only the centroid information needs to be stored. The experimentations are carried at 45 benchmark datasets and 5 randomly generated datasets. The results are compared against Nearest Centroid, Logistic Regression, K-Nearest Neighbors and Decision Tree algorithms. The parameters of the new algorithm and of these traditional classification algorithms are tuned before comparison. The results are promising and has potential to trigger further research.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
As one of the most important theorems in statistical learning, the no free lunch theorem [1] states that the performance of an algorithm could be better than others at some problems and worse at some others, which leads to some type of equivalence among algorithms. Due to the natural differences among classification problems many different classifiers are designed so far. In this paper we introduce a novel variant of Nearest Centroid (NC) classifier [2].
In our new algorithm, the distance measure is a weighted Euclidean metric where the predictive power of each feature and homogeneity of each feature at each class are used to determine weights. Information Value (IV) metric is selected as the metric showing the predictive power of features. For measuring homogeneity, mean absolute deviation, standard deviation, variance and coefficient of variance metrics are used. The choice of these metrics became a parameter of our algorithm and is used at the tuning phase.
A binary classification model provides predicted binary classes, predicted probabilities of each class or rank information of these probabilities and different performance measures are used for evaluation [3]. We have used accuracy measure to compare the performance of the algorithm. For benchmarking, we used 50 different datasets and 4 different algorithms, namely Nearest Centroid (NC), Logistic Regression (LR), K-Nearest Neighbours (KNN) and Decision Tree (DT). The new algorithm outperformed NC at 43 datasets, LR at 8 datasets, KNN at 17 datasets and DT at 15 datasets. It was the best classifier at 5 datasets with respect to accuracy measure.
2 Related Research
Nearest Centroid Classifier [2] is a memory-based classifier which is simple and fast. It stores the centroid information of each class and then makes distance calculations for each new instance. The nearest centroids’ class is assigned as the predicted class of that instance; therefore, the algorithm is also called as Minimum Distance Classifier (MDC).
Instead of class centroids, some instances at the training dataset could be chosen to represent that class [4]. These instances are called as prototypes.
Nearest shrunken centroid classifier [5] is another variant of Nearest Centroid algorithm. It shrinks each of the class centroids toward the overall centroid for all classes by an amount called threshold. After the centroids are shrunken, a new instance is classified by nearest centroid rule, but using shrunken class centroids.
Nearest Centroid Classifier, K-Means Clustering [6] and K-Nearest Neighbours (KNN) [7] algorithms are closely related with each other since they consider centroid information. KNN and its variants [8] are used for both classification and regression problems. Using weights for centroid based distances are common and so far, different measures are used. For example, Gou et.al. [9] proposed an algorithm which captures both the proximity and the geometry of k-nearest neighbours. At another recent research, Elen et.al. [10] proposes a new classifier by addressing the noise issue at classification by introducing standardized variable distances.
Baesens et.al. [11] compared the performance of different algorithms over some datasets and two of them are in our list, namely “Australian” and “credit-g (German Credit)”. For the Australian dataset, the accuracy of our new algorithm is the third best one, after the algorithms “C4.5rulcs dis” and “C4.5dis”. For the German Credit dataset, the accuracy of our new algorithm (76.40) is above the best algorithm tested (LDA with a value of 74.6). Lessmann et.al. [12] updated this research by testing various classification algorithms over 8 datasets.
The novelty of our algorithm is at the usage of both information value and homogeneity metrics for weighting the distance calculation.
3 Methods
3.1 Information Value
Information Value (IV) metric is used to determine the predictive power of each feature. Its calculation is based on “Weight of Evidence” values [13]. The Weight of Evidence (WOE) of a feature value is defined as the natural logarithm of the ratio of the share of one class in a given value over the share of other class as shown in Eq. 1.
Equation 2 gives the computation of Information Value over the Weight of Evidence values. The differences of percentage shares of classes are used as weights at the summation.
Information Value could be calculated only for categoric variables. For that reason, we split continuous variables into 10 equal sized bins and then made the calculation.
Both Information Value and Weight of Evidence metrics are widely used at data analysis for credit scoring at the banking sector.
3.2 Homogeneity Metrics
For each feature, the following sparsity metrics are computed to represent the inverse of homogeneity within each class:
-
1.
Mean absolute deviation
-
2.
Standard deviation
-
3.
Variance
-
4.
Coefficient of variation
All of them are used at the training phase to fit classifier model into dataset. As the last step the one with the highest accuracy is marked as the selected homogeneity metric. At the tuning phase of the algorithm, in addition to these 4 choices, using no metric for homogeneity is also checked and it is selected when its accuracy is highest. Therefore, there were 5 alternatives for homogeneity.
3.3 The Algorithm
At the training phase, first, the centroids of each class, information values of each feature and homogeneity values of each feature at each class are computed. Then, for the given dataset, the best homogeneity metric is determined by running the scoring at the training dataset using the accuracy metric. Figure 1 (a) depicts the flow of training phase.
At the scoring phase, for each new instance, the Euclidean distance to each class centroid is calculated where the weights are determined as the division of the feature’s Information Value to the selected sparsity metric. The class with the minimum weighted distance becomes the predicted class for that instance. Figure 1 (b) depicts the flow of scoring phase.
The algorithm is a memory-based algorithm but it does not require storing the instances at the memory, instead only class summaries (i.e., centroid vectors) need to be stored. We used only datasets of binary classification problems but the algorithm could also be used for multiclass classification problems.
4 Experimentation and Results
4.1 Setup
All experiments are performed on a PC equipped with 4 Core Intel i7 11th Gen CPU at 2.80 GHz and 32 GB RAM running Microsoft Windows 11 Pro, Anaconda 3, Conda 22.9.0, Jupyter-notebook 6.4.12 and Python 3.9.13.
4.2 Datasets
We used OpenML [14] machine learning repository which is a public online platform built for scientific collaboration. We also used its Python API [15] to access datasets at the repository. We downloaded 45 binary classification problem datasets using this API. We also generated 5 synthetic datasets for classification via “make_classification” function of Scikit-Learn library [16] and named with a prefix “random”. Table 1 shows the datasets used at the experiments.
The second column (“data_id”) refers to the unique identifier at OpenML repository. Table 1 also lists the number of instances and features at each dataset. Since we pre-processed data, the numbers of features are changed and the final number of features are given at the last column of the table.
4.3 Pre-Processing
The following pre-processing steps are applied to all datasets:
-
Splitting: Each dataset is randomly divided into training and test datasets where the share of test dataset set to 25%.
-
Null Value Imputation: Null values were replaced by zero values.
-
Winsorization: For numeric features of the training dataset, 5% cut-off values are calculated from each end. The values beyond these limits were replaced by the cut-off values. Winsorization is applied only when the number of unique values of the feature is greater than 60.
-
Min-Max Scaling: All numeric feature values are proportionally scaled into [0,1] range.
-
One-hot encoding: For each distinct value of a categoric feature, a new flag variable is created.
4.4 Tuning
For a healthy comparison, the hyperparameters of Logistic Regression, KNN and Decision Tree algorithms are tuned with the options shown in Table 2. For each dataset, the tuning is made with an exhaustive search over these parameters via the “GridSearchCV” function of Scikit-learn library [16].
4.5 Results
Table 3 shows the accuracy values of 5 algorithms over the test datasets. The proposed algorithm outperforms all other four algorithms at 5 datasets and shares the first place at another 3 datasets. The algorithm has a better accuracy score than nearest centroid at 43 datasets. It was better than Logistic Regression, KNN and Decision Tree algorithms 8, 17 and 15 datasets respectively.
We calculated the correlation coefficients among the accuracy values of our algorithm with the accuracy values of others. Our algorithms accuracy values over test datasets are correlated with the ones of Nearest Centroid, Logistic Regression, KNN and Decision Trees by 0.28, 0.25, 0.30 and 0.35 respectively.
5 Discussion and Conclusion
Weight of Evidence and Information Value metrics are commonly used at banking to predict credit defaults and in our experiments, there were three datasets that are in credit risk domain, namely “Australian”, “credit-approval” and “credit-g (German Credit)”. In all of these three datasets, the new algorithm was better than all others. This could be further investigated using additional datasets from credit risk domain.
The new algorithm has clear superiority over Nearest Centroid algorithm and comparable performance to other classic algorithms. It could be improved by considering other characteristics such as the size of the classes.
To improve the performance of the algorithm various alternatives may be considered. For example, instead of Information Value, another measure of predictive performance such as variable importance could be used. Similarly, at the distance calculation, Euclidean formula could be replaced by other distance metrics such as Manhattan.
Over 50 datasets, the new algorithm outperforms the benchmark algorithms at 5 datasets and shares the first place at another 3 datasets. Therefore, the new algorithm is comparable to well-known algorithms and it should be enhanced further.
6 Declaration of Competing Interest
The authors declare that there is no conflict of interest identified in this study.
References
Wolpert, D.H.: The supervised learning no-free-lunch theorems. Soft Comput. Ind. 25–42 (2002)
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2, p. 670. Springer, New York (2009)
Shmueli, G.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010)
Kuncheva, L.I.: Prototype classifiers and the big fish: the case of prototype (instance) selection. IEEE Syst. Man Cybern. Mag. 6(2), 49–56 (2020)
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Alpaydin, E.: Voting over multiple condensed nearest neighbors. In: Aha, D.W. (eds.) Lazy Learning, pp. 115–132. Springer, Dordrecht (1997). https://doi.org/10.1007/978-94-017-2053-3_4
Gou, J., et al.: A representation coefficient-based k-nearest centroid neighbor classifier. Expert Syst. Appl. 194, 116529 (2022)
Elen, A., Avuçlu, E.: Standardized Variable Distances: a distance-based machine learning method. Appl. Soft Comput. 98, 106855 (2021)
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J.: Benchmarking state-of-the-art classification algorithms for credit scoring. J. Oper. Res. Soc. 54, 627–635 (2003)
Lessmann, S., Baesens, B., Seow, H.V., Thomas, L.C.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247(1), 124–136 (2015)
Siddiqi, N.: Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards, pp.186–197. Wiley (2017)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newslett. 15(2), 49–60 (2014)
Feurer, M., et al.: OpenML-Python: an extensible Python API for OpenML. J. Mach. Learn. Res. 22(1), 4573–4577 (2021)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Acknowledgements
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Özçelik, M.H., Bulkan, S. (2024). Nearest Centroid Classifier Based on Information Value and Homogeneity. In: Şen, Z., Uygun, Ö., Erden, C. (eds) Advances in Intelligent Manufacturing and Service System Informatics. IMSS 2023. Lecture Notes in Mechanical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-99-6062-0_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-6062-0_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6061-3
Online ISBN: 978-981-99-6062-0
eBook Packages: EngineeringEngineering (R0)