Keywords

1 Introduction

As one of the most important theorems in statistical learning, the no free lunch theorem [1] states that the performance of an algorithm could be better than others at some problems and worse at some others, which leads to some type of equivalence among algorithms. Due to the natural differences among classification problems many different classifiers are designed so far. In this paper we introduce a novel variant of Nearest Centroid (NC) classifier [2].

In our new algorithm, the distance measure is a weighted Euclidean metric where the predictive power of each feature and homogeneity of each feature at each class are used to determine weights. Information Value (IV) metric is selected as the metric showing the predictive power of features. For measuring homogeneity, mean absolute deviation, standard deviation, variance and coefficient of variance metrics are used. The choice of these metrics became a parameter of our algorithm and is used at the tuning phase.

A binary classification model provides predicted binary classes, predicted probabilities of each class or rank information of these probabilities and different performance measures are used for evaluation [3]. We have used accuracy measure to compare the performance of the algorithm. For benchmarking, we used 50 different datasets and 4 different algorithms, namely Nearest Centroid (NC), Logistic Regression (LR), K-Nearest Neighbours (KNN) and Decision Tree (DT). The new algorithm outperformed NC at 43 datasets, LR at 8 datasets, KNN at 17 datasets and DT at 15 datasets. It was the best classifier at 5 datasets with respect to accuracy measure.

2 Related Research

Nearest Centroid Classifier [2] is a memory-based classifier which is simple and fast. It stores the centroid information of each class and then makes distance calculations for each new instance. The nearest centroids’ class is assigned as the predicted class of that instance; therefore, the algorithm is also called as Minimum Distance Classifier (MDC).

Instead of class centroids, some instances at the training dataset could be chosen to represent that class [4]. These instances are called as prototypes.

Nearest shrunken centroid classifier [5] is another variant of Nearest Centroid algorithm. It shrinks each of the class centroids toward the overall centroid for all classes by an amount called threshold. After the centroids are shrunken, a new instance is classified by nearest centroid rule, but using shrunken class centroids.

Nearest Centroid Classifier, K-Means Clustering [6] and K-Nearest Neighbours (KNN) [7] algorithms are closely related with each other since they consider centroid information. KNN and its variants [8] are used for both classification and regression problems. Using weights for centroid based distances are common and so far, different measures are used. For example, Gou et.al. [9] proposed an algorithm which captures both the proximity and the geometry of k-nearest neighbours. At another recent research, Elen et.al. [10] proposes a new classifier by addressing the noise issue at classification by introducing standardized variable distances.

Baesens et.al. [11] compared the performance of different algorithms over some datasets and two of them are in our list, namely “Australian” and “credit-g (German Credit)”. For the Australian dataset, the accuracy of our new algorithm is the third best one, after the algorithms “C4.5rulcs dis” and “C4.5dis”. For the German Credit dataset, the accuracy of our new algorithm (76.40) is above the best algorithm tested (LDA with a value of 74.6). Lessmann et.al. [12] updated this research by testing various classification algorithms over 8 datasets.

The novelty of our algorithm is at the usage of both information value and homogeneity metrics for weighting the distance calculation.

3 Methods

3.1 Information Value

Information Value (IV) metric is used to determine the predictive power of each feature. Its calculation is based on “Weight of Evidence” values [13]. The Weight of Evidence (WOE) of a feature value is defined as the natural logarithm of the ratio of the share of one class in a given value over the share of other class as shown in Eq. 1.

$$ WoE\left( {X_i = X_{ij} } \right) = ln\left( {\frac{share\;of\;responses\;of\; that\; category\; at\; all\; responses}{{share \;of\; nonresponses\; of\; that\; category\; at\; all\; nonresponses}}} \right) $$
(1)

Equation 2 gives the computation of Information Value over the Weight of Evidence values. The differences of percentage shares of classes are used as weights at the summation.

$$ IV\left( X \right) = \mathop \sum \nolimits_{j = 1}^V \left( {Distribution \;of\;class \;1 - Distribution\; of\; class\; 0} \right) \cdot WoE_j $$
(2)

Information Value could be calculated only for categoric variables. For that reason, we split continuous variables into 10 equal sized bins and then made the calculation.

Both Information Value and Weight of Evidence metrics are widely used at data analysis for credit scoring at the banking sector.

3.2 Homogeneity Metrics

For each feature, the following sparsity metrics are computed to represent the inverse of homogeneity within each class:

  1. 1.

    Mean absolute deviation

  2. 2.

    Standard deviation

  3. 3.

    Variance

  4. 4.

    Coefficient of variation

All of them are used at the training phase to fit classifier model into dataset. As the last step the one with the highest accuracy is marked as the selected homogeneity metric. At the tuning phase of the algorithm, in addition to these 4 choices, using no metric for homogeneity is also checked and it is selected when its accuracy is highest. Therefore, there were 5 alternatives for homogeneity.

3.3 The Algorithm

At the training phase, first, the centroids of each class, information values of each feature and homogeneity values of each feature at each class are computed. Then, for the given dataset, the best homogeneity metric is determined by running the scoring at the training dataset using the accuracy metric. Figure 1 (a) depicts the flow of training phase.

At the scoring phase, for each new instance, the Euclidean distance to each class centroid is calculated where the weights are determined as the division of the feature’s Information Value to the selected sparsity metric. The class with the minimum weighted distance becomes the predicted class for that instance. Figure 1 (b) depicts the flow of scoring phase.

Fig. 1.
figure 1

Algorithm flowcharts

The algorithm is a memory-based algorithm but it does not require storing the instances at the memory, instead only class summaries (i.e., centroid vectors) need to be stored. We used only datasets of binary classification problems but the algorithm could also be used for multiclass classification problems.

4 Experimentation and Results

4.1 Setup

All experiments are performed on a PC equipped with 4 Core Intel i7 11th Gen CPU at 2.80 GHz and 32 GB RAM running Microsoft Windows 11 Pro, Anaconda 3, Conda 22.9.0, Jupyter-notebook 6.4.12 and Python 3.9.13.

4.2 Datasets

We used OpenML [14] machine learning repository which is a public online platform built for scientific collaboration. We also used its Python API [15] to access datasets at the repository. We downloaded 45 binary classification problem datasets using this API. We also generated 5 synthetic datasets for classification via “make_classification” function of Scikit-Learn library [16] and named with a prefix “random”. Table 1 shows the datasets used at the experiments.

The second column (“data_id”) refers to the unique identifier at OpenML repository. Table 1 also lists the number of instances and features at each dataset. Since we pre-processed data, the numbers of features are changed and the final number of features are given at the last column of the table.

Table 1. List of datasets

4.3 Pre-Processing

The following pre-processing steps are applied to all datasets:

  • Splitting: Each dataset is randomly divided into training and test datasets where the share of test dataset set to 25%.

  • Null Value Imputation: Null values were replaced by zero values.

  • Winsorization: For numeric features of the training dataset, 5% cut-off values are calculated from each end. The values beyond these limits were replaced by the cut-off values. Winsorization is applied only when the number of unique values of the feature is greater than 60.

  • Min-Max Scaling: All numeric feature values are proportionally scaled into [0,1] range.

  • One-hot encoding: For each distinct value of a categoric feature, a new flag variable is created.

4.4 Tuning

For a healthy comparison, the hyperparameters of Logistic Regression, KNN and Decision Tree algorithms are tuned with the options shown in Table 2. For each dataset, the tuning is made with an exhaustive search over these parameters via the “GridSearchCV” function of Scikit-learn library [16].

Table 2. Parameters used at the tuning

4.5 Results

Table 3 shows the accuracy values of 5 algorithms over the test datasets. The proposed algorithm outperforms all other four algorithms at 5 datasets and shares the first place at another 3 datasets. The algorithm has a better accuracy score than nearest centroid at 43 datasets. It was better than Logistic Regression, KNN and Decision Tree algorithms 8, 17 and 15 datasets respectively.

We calculated the correlation coefficients among the accuracy values of our algorithm with the accuracy values of others. Our algorithms accuracy values over test datasets are correlated with the ones of Nearest Centroid, Logistic Regression, KNN and Decision Trees by 0.28, 0.25, 0.30 and 0.35 respectively.

Table 3. Accuracy of algorithms at test datasets

5 Discussion and Conclusion

Weight of Evidence and Information Value metrics are commonly used at banking to predict credit defaults and in our experiments, there were three datasets that are in credit risk domain, namely “Australian”, “credit-approval” and “credit-g (German Credit)”. In all of these three datasets, the new algorithm was better than all others. This could be further investigated using additional datasets from credit risk domain.

The new algorithm has clear superiority over Nearest Centroid algorithm and comparable performance to other classic algorithms. It could be improved by considering other characteristics such as the size of the classes.

To improve the performance of the algorithm various alternatives may be considered. For example, instead of Information Value, another measure of predictive performance such as variable importance could be used. Similarly, at the distance calculation, Euclidean formula could be replaced by other distance metrics such as Manhattan.

Over 50 datasets, the new algorithm outperforms the benchmark algorithms at 5 datasets and shares the first place at another 3 datasets. Therefore, the new algorithm is comparable to well-known algorithms and it should be enhanced further.

6 Declaration of Competing Interest

The authors declare that there is no conflict of interest identified in this study.