Nearest Centroid Classifier Based on Information Value and Homogeneity

Özçelik, Mehmet Hamdi; Bulkan, Serol

doi:10.1007/978-981-99-6062-0_5

Part of the book series: Lecture Notes in Mechanical Engineering ((LNME))

Included in the following conference series:

International Symposium on Intelligent Manufacturing and Service Systems

484 Accesses

Abstract

The aim of this paper is to introduce a novel classification algorithm based on distance to class centroids with weighted Euclidean distance metric. Features are weighted by their predictive powers and in-class homogeneities. For predictive power, information value metric is used. For in-class homogeneity different measures are used. The algorithm is memory based but only the centroid information needs to be stored. The experimentations are carried at 45 benchmark datasets and 5 randomly generated datasets. The results are compared against Nearest Centroid, Logistic Regression, K-Nearest Neighbors and Decision Tree algorithms. The parameters of the new algorithm and of these traditional classification algorithms are tuned before comparison. The results are promising and has potential to trigger further research.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Pseudo Nearest Centroid Neighbor Classification

Combining k-Nearest Neighbor and Centroid Neighbor Classifier for Fast and Robust Classification

Centroid-Based Classification of Categorical Data

Keywords

1 Introduction

As one of the most important theorems in statistical learning, the no free lunch theorem [1] states that the performance of an algorithm could be better than others at some problems and worse at some others, which leads to some type of equivalence among algorithms. Due to the natural differences among classification problems many different classifiers are designed so far. In this paper we introduce a novel variant of Nearest Centroid (NC) classifier [2].

In our new algorithm, the distance measure is a weighted Euclidean metric where the predictive power of each feature and homogeneity of each feature at each class are used to determine weights. Information Value (IV) metric is selected as the metric showing the predictive power of features. For measuring homogeneity, mean absolute deviation, standard deviation, variance and coefficient of variance metrics are used. The choice of these metrics became a parameter of our algorithm and is used at the tuning phase.

A binary classification model provides predicted binary classes, predicted probabilities of each class or rank information of these probabilities and different performance measures are used for evaluation [3]. We have used accuracy measure to compare the performance of the algorithm. For benchmarking, we used 50 different datasets and 4 different algorithms, namely Nearest Centroid (NC), Logistic Regression (LR), K-Nearest Neighbours (KNN) and Decision Tree (DT). The new algorithm outperformed NC at 43 datasets, LR at 8 datasets, KNN at 17 datasets and DT at 15 datasets. It was the best classifier at 5 datasets with respect to accuracy measure.

2 Related Research

Nearest Centroid Classifier [2] is a memory-based classifier which is simple and fast. It stores the centroid information of each class and then makes distance calculations for each new instance. The nearest centroids’ class is assigned as the predicted class of that instance; therefore, the algorithm is also called as Minimum Distance Classifier (MDC).

Instead of class centroids, some instances at the training dataset could be chosen to represent that class [4]. These instances are called as prototypes.

Nearest shrunken centroid classifier [5] is another variant of Nearest Centroid algorithm. It shrinks each of the class centroids toward the overall centroid for all classes by an amount called threshold. After the centroids are shrunken, a new instance is classified by nearest centroid rule, but using shrunken class centroids.

Nearest Centroid Classifier, K-Means Clustering [6] and K-Nearest Neighbours (KNN) [7] algorithms are closely related with each other since they consider centroid information. KNN and its variants [8] are used for both classification and regression problems. Using weights for centroid based distances are common and so far, different measures are used. For example, Gou et.al. [9] proposed an algorithm which captures both the proximity and the geometry of k-nearest neighbours. At another recent research, Elen et.al. [10] proposes a new classifier by addressing the noise issue at classification by introducing standardized variable distances.

Baesens et.al. [11] compared the performance of different algorithms over some datasets and two of them are in our list, namely “Australian” and “credit-g (German Credit)”. For the Australian dataset, the accuracy of our new algorithm is the third best one, after the algorithms “C4.5rulcs dis” and “C4.5dis”. For the German Credit dataset, the accuracy of our new algorithm (76.40) is above the best algorithm tested (LDA with a value of 74.6). Lessmann et.al. [12] updated this research by testing various classification algorithms over 8 datasets.

The novelty of our algorithm is at the usage of both information value and homogeneity metrics for weighting the distance calculation.

3 Methods

3.1 Information Value

Information Value (IV) metric is used to determine the predictive power of each feature. Its calculation is based on “Weight of Evidence” values [13]. The Weight of Evidence (WOE) of a feature value is defined as the natural logarithm of the ratio of the share of one class in a given value over the share of other class as shown in Eq. 1.

$$ WoE\left( {X_i = X_{ij} } \right) = ln\left( {\frac{share\;of\;responses\;of\; that\; category\; at\; all\; responses}{{share \;of\; nonresponses\; of\; that\; category\; at\; all\; nonresponses}}} \right) $$

(1)

Equation 2 gives the computation of Information Value over the Weight of Evidence values. The differences of percentage shares of classes are used as weights at the summation.

$$ IV\left( X \right) = \mathop \sum \nolimits_{j = 1}^V \left( {Distribution \;of\;class \;1 - Distribution\; of\; class\; 0} \right) \cdot WoE_j $$

(2)

Information Value could be calculated only for categoric variables. For that reason, we split continuous variables into 10 equal sized bins and then made the calculation.

Both Information Value and Weight of Evidence metrics are widely used at data analysis for credit scoring at the banking sector.

3.2 Homogeneity Metrics

For each feature, the following sparsity metrics are computed to represent the inverse of homogeneity within each class:

1.
Mean absolute deviation
2.
Standard deviation
3.
Variance
4.
Coefficient of variation

All of them are used at the training phase to fit classifier model into dataset. As the last step the one with the highest accuracy is marked as the selected homogeneity metric. At the tuning phase of the algorithm, in addition to these 4 choices, using no metric for homogeneity is also checked and it is selected when its accuracy is highest. Therefore, there were 5 alternatives for homogeneity.

3.3 The Algorithm

At the training phase, first, the centroids of each class, information values of each feature and homogeneity values of each feature at each class are computed. Then, for the given dataset, the best homogeneity metric is determined by running the scoring at the training dataset using the accuracy metric. Figure 1 (a) depicts the flow of training phase.

At the scoring phase, for each new instance, the Euclidean distance to each class centroid is calculated where the weights are determined as the division of the feature’s Information Value to the selected sparsity metric. The class with the minimum weighted distance becomes the predicted class for that instance. Figure 1 (b) depicts the flow of scoring phase.

The algorithm is a memory-based algorithm but it does not require storing the instances at the memory, instead only class summaries (i.e., centroid vectors) need to be stored. We used only datasets of binary classification problems but the algorithm could also be used for multiclass classification problems.

4 Experimentation and Results

4.1 Setup

All experiments are performed on a PC equipped with 4 Core Intel i7 11th Gen CPU at 2.80 GHz and 32 GB RAM running Microsoft Windows 11 Pro, Anaconda 3, Conda 22.9.0, Jupyter-notebook 6.4.12 and Python 3.9.13.

4.2 Datasets

We used OpenML [14] machine learning repository which is a public online platform built for scientific collaboration. We also used its Python API [15] to access datasets at the repository. We downloaded 45 binary classification problem datasets using this API. We also generated 5 synthetic datasets for classification via “make_classification” function of Scikit-Learn library [16] and named with a prefix “random”. Table 1 shows the datasets used at the experiments.

The second column (“data_id”) refers to the unique identifier at OpenML repository. Table 1 also lists the number of instances and features at each dataset. Since we pre-processed data, the numbers of features are changed and the final number of features are given at the last column of the table.

Table 1. List of datasets

Full size table

4.3 Pre-Processing

The following pre-processing steps are applied to all datasets:

Splitting: Each dataset is randomly divided into training and test datasets where the share of test dataset set to 25%.
Null Value Imputation: Null values were replaced by zero values.
Winsorization: For numeric features of the training dataset, 5% cut-off values are calculated from each end. The values beyond these limits were replaced by the cut-off values. Winsorization is applied only when the number of unique values of the feature is greater than 60.
Min-Max Scaling: All numeric feature values are proportionally scaled into [0,1] range.
One-hot encoding: For each distinct value of a categoric feature, a new flag variable is created.

4.4 Tuning

For a healthy comparison, the hyperparameters of Logistic Regression, KNN and Decision Tree algorithms are tuned with the options shown in Table 2. For each dataset, the tuning is made with an exhaustive search over these parameters via the “GridSearchCV” function of Scikit-learn library [16].

Table 2. Parameters used at the tuning

Full size table

4.5 Results

Table 3 shows the accuracy values of 5 algorithms over the test datasets. The proposed algorithm outperforms all other four algorithms at 5 datasets and shares the first place at another 3 datasets. The algorithm has a better accuracy score than nearest centroid at 43 datasets. It was better than Logistic Regression, KNN and Decision Tree algorithms 8, 17 and 15 datasets respectively.

We calculated the correlation coefficients among the accuracy values of our algorithm with the accuracy values of others. Our algorithms accuracy values over test datasets are correlated with the ones of Nearest Centroid, Logistic Regression, KNN and Decision Trees by 0.28, 0.25, 0.30 and 0.35 respectively.

Table 3. Accuracy of algorithms at test datasets

Full size table

5 Discussion and Conclusion

Weight of Evidence and Information Value metrics are commonly used at banking to predict credit defaults and in our experiments, there were three datasets that are in credit risk domain, namely “Australian”, “credit-approval” and “credit-g (German Credit)”. In all of these three datasets, the new algorithm was better than all others. This could be further investigated using additional datasets from credit risk domain.

The new algorithm has clear superiority over Nearest Centroid algorithm and comparable performance to other classic algorithms. It could be improved by considering other characteristics such as the size of the classes.

To improve the performance of the algorithm various alternatives may be considered. For example, instead of Information Value, another measure of predictive performance such as variable importance could be used. Similarly, at the distance calculation, Euclidean formula could be replaced by other distance metrics such as Manhattan.

Over 50 datasets, the new algorithm outperforms the benchmark algorithms at 5 datasets and shares the first place at another 3 datasets. Therefore, the new algorithm is comparable to well-known algorithms and it should be enhanced further.

6 Declaration of Competing Interest

The authors declare that there is no conflict of interest identified in this study.

References

Wolpert, D.H.: The supervised learning no-free-lunch theorems. Soft Comput. Ind. 25–42 (2002)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2, p. 670. Springer, New York (2009)
Google Scholar
Shmueli, G.: To explain or to predict? Stat. Sci. 25(3), 289–310 (2010)
Article MathSciNet MATH Google Scholar
Kuncheva, L.I.: Prototype classifiers and the big fish: the case of prototype (instance) selection. IEEE Syst. Man Cybern. Mag. 6(2), 49–56 (2020)
Article Google Scholar
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99(10), 6567–6572 (2002)
Article Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Article MATH Google Scholar
Alpaydin, E.: Voting over multiple condensed nearest neighbors. In: Aha, D.W. (eds.) Lazy Learning, pp. 115–132. Springer, Dordrecht (1997). https://doi.org/10.1007/978-94-017-2053-3_4
Gou, J., et al.: A representation coefficient-based k-nearest centroid neighbor classifier. Expert Syst. Appl. 194, 116529 (2022)
Article Google Scholar
Elen, A., Avuçlu, E.: Standardized Variable Distances: a distance-based machine learning method. Appl. Soft Comput. 98, 106855 (2021)
Article Google Scholar
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J.: Benchmarking state-of-the-art classification algorithms for credit scoring. J. Oper. Res. Soc. 54, 627–635 (2003)
Article MATH Google Scholar
Lessmann, S., Baesens, B., Seow, H.V., Thomas, L.C.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247(1), 124–136 (2015)
Article MATH Google Scholar
Siddiqi, N.: Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards, pp.186–197. Wiley (2017)
Google Scholar
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newslett. 15(2), 49–60 (2014)
Article Google Scholar
Feurer, M., et al.: OpenML-Python: an extensible Python API for OpenML. J. Mach. Learn. Res. 22(1), 4573–4577 (2021)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

Applied Analytics AA, İstanbul, Türkiye
Mehmet Hamdi Özçelik
Department of Industrial Engineering, Marmara University, İstanbul, Türkiye
Mehmet Hamdi Özçelik & Serol Bulkan

Authors

Mehmet Hamdi Özçelik
View author publications
You can also search for this author in PubMed Google Scholar
Serol Bulkan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehmet Hamdi Özçelik .

Editor information

Editors and Affiliations

Istanbul Medipol University, Istanbul, Türkiye
Zekâi Şen
Department of Industrial Engineering, Sakarya University, Serdivan, Sakarya, Türkiye
Özer Uygun
Faculty of Applied Sciences, Sakarya University of Applied Sciences, Kaynarca, Sakarya, Türkiye
Caner Erden

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Özçelik, M.H., Bulkan, S. (2024). Nearest Centroid Classifier Based on Information Value and Homogeneity. In: Şen, Z., Uygun, Ö., Erden, C. (eds) Advances in Intelligent Manufacturing and Service System Informatics. IMSS 2023. Lecture Notes in Mechanical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-99-6062-0_5

Download citation

DOI: https://doi.org/10.1007/978-981-99-6062-0_5
Published: 02 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6061-3
Online ISBN: 978-981-99-6062-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Nearest Centroid Classifier Based on Information Value and Homogeneity

Abstract

Similar content being viewed by others

Pseudo Nearest Centroid Neighbor Classification

Combining k-Nearest Neighbor and Centroid Neighbor Classifier for Fast and Robust Classification

Centroid-Based Classification of Categorical Data

Keywords

1 Introduction

2 Related Research

3 Methods

3.1 Information Value

3.2 Homogeneity Metrics

3.3 The Algorithm

4 Experimentation and Results

4.1 Setup

4.2 Datasets

4.3 Pre-Processing

4.4 Tuning

4.5 Results

5 Discussion and Conclusion

6 Declaration of Competing Interest

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Nearest Centroid Classifier Based on Information Value and Homogeneity

Abstract

Similar content being viewed by others

Pseudo Nearest Centroid Neighbor Classification

Combining k-Nearest Neighbor and Centroid Neighbor Classifier for Fast and Robust Classification

Centroid-Based Classification of Categorical Data

Keywords

1 Introduction

2 Related Research

3 Methods

3.1 Information Value

3.2 Homogeneity Metrics

3.3 The Algorithm

4 Experimentation and Results

4.1 Setup

4.2 Datasets

4.3 Pre-Processing

4.4 Tuning

4.5 Results

5 Discussion and Conclusion

6 Declaration of Competing Interest

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation