1 Introduction

The most successful operational research technique in the financial sector is credit scoring [1]. One of the most crucial procedures in a bank is credit evaluation, which is also known as a credit management decision. This procedure includes the collection and analysis of credit management decisions. This procedure involves collection and analyzation of data and classification of different credit variables to arrive at a credit decision [2]. Empirical credit scoring has replaced judgment credit scoring. Empirical credit scoring is essentially the method which produces a “score” that a bank can use to rank its loan applicants or borrowers in terms of risk. To build an empirical credit scoring, historical data and statistical techniques were used; credit scoring tries to isolate the effects of various applicant characteristics on delinquencies and defaults. For example, the different between good and bad account [3]. Empirical credit scoring is more formal and accurate than judgment credit scoring so it has been improved based on the results of judgment credit scoring. Different statistical and artificial intelligence (AI) system techniques have been utilized to improve the performance of empirical credit scoring models [3, 4], such as discriminate analysis [5], linear probability, logit [6], and probit models [7]. These statistical techniques are suitable for identifying a linear relationship between independent and dependent variables. However, when these variables have non linear relationships, an AI technique such as a neural network is more appropriate than statistical techniques. Examples of AI techniques include the Bayesian network, support vector machine (SVM) and integer programming [8], k-nearest neighbor (KNN) [9], and classification tree [10]. These techniques are commonly used to model and assess credit risks [11]. However, these methods possess weaknesses. For example, representation of knowledge by an artificial neural network (ANN)is difficult because an ANN requires a large number of training samples and a long learning time [12].

Clustering is the most recommended classification method. Cluster analysis is a non-parametric statistical method that can be used for credit scoring. One of its principal advantages is that it does not assume presence of a specific data distribution. Thus, it is suitable when prior knowledge is insufficient. Cluster analysis is typically applied when no prior hypotheses exist. The method is exploratory and identifies the most likely solution so it is suitable for credit risk analysis [13]. In clustering, similar data points or elements are grouped into the same cluster and different data points are placed into different clusters. Clustering is performed through partitioning and hierarchical clustering. Partitioning clustering employs either the hard or the fuzzy clustering technique. Hard clustering assumes that each element of the dataset belongs to only one cluster, whereas fuzzy clustering assumes that each element (point) in a dataset could belong fully or partially to all or several clusters. Fuzzy clustering is more flexible than hard clustering and thus more suitable for real-world problems [14].

The GK algorithm is a powerful fuzzy clustering technique that has been used in various applications, such as image processing, data classification, and system identification. The GK algorithm is similar to the fuzzy clustering method (FCM) algorithm. The main difference between them is how they calculate distances. The FCM algorithm uses the square Euclidean distance measure, whereas the GK algorithm uses the Mahalanobis distance measure. Mahalanobis distance employs a covariance matrix, in which the clusters are in ellipsoidal, hyper-ellipsoidal, or other forms. Clusters in the FCM algorithm are in a spherical form, and the cluster shape does not change according to the type of data.

In cluster analysis, one of the most challenging problems is the number of clusters in a dataset. The number of clusters is usually known or fixed in advance, and this value is a crucial parameter for most clustering algorithms. However, having a predetermined number of clusters is unrealistic for many data analyses in the real world. Therefore, for algorithms such as the GK algorithm, a cluster validation technique needs to be used to determine the number of clusters [14].

Clusters in high dimensions cannot be visualized because high-dimensional data have their own distinct characteristics. This impossibility makes the identification and definition of clusters within these data challenging [15].

Credit scoring databases are often large and contain many redundant and irrelevant features. Classifying these data is therefore demanding in terms of computation. This difficulty can be overcome by using a feature selection method.

This study proposes a modified binary particle swarm optimization (MBPSO) algorithm to address the problems of feature subset selection and determination of the number of clusters in credit scoring data. Classification accuracy is improved to build a simple and robust credit scoring model. The most relevant features are selected, and an evolutionary computing optimization-based approach is utilized. Binary particle swarm optimization (BPSO) and kernel fuzzy clustering are the sources of inspiration. Although the proposed method is under BPSO, it is different from other BPSO methods in terms of the representation of the positions of particles (problems of feature subset selection and integrating the number of clusters) and in terms of updating the positions of the particles (inclusion/exclusion of features and number of clusters). The proposed approach is essentially a modification of the discrete PSO method [16].

The remainder of the paper is organized as follows. Section 2 provides a brief background and review of prior studies on feature subset selection and number of clusters. Section 3 describes the proposed MBPSO+GK. Section 4 presents the GK algorithm. Section 5 shows the measures of cluster validity. Section 6 shows the results of the proposed MBPSO+GK and a comparison of these results with those of two baseline algorithms, namely, BPSO+GK and GK. The conclusions and several recommendations for future research are presented in Section 7.

2 Background and review of prior studies

2.1 Feature selection

Several methods have been proposed to improve feature selection. These include filter and wrapper methods. Filter methods use factor, principal component, discriminant, or independent component analysis to statistically test the variables and identify the best features. These methods can also use other distance and information measures for indirect performance measurement. Filter methods assess the key properties of the data and the properties of the classifier. Hence, these methods are quick and straightforward. However, filter methods are sensitive to redundancy [17].

By contrast, wrapper methods consider the accuracy of the classifier in selecting the best features. Consequently, the results are dependent on the type of classification algorithm used. Given that the resulting subset of features is closely linked to the classifier used, these methods are often not generalizable. Another drawback of wrapper approaches is that the search space for the selection of (n) features is 2n; searching for features is therefore computationally expensive [18]. Regardless of the approach adopted for feature selection, the search strategy can significantly influence the results. Many wrapper techniques have been utilized for feature selection, but most of them tend to become stuck in local optima [19]. Most wrapper algorithms are either exact methods that apply the branch and bound principle or “involve” methods that apply greedy sequential subset selection, mathematical programming, nested partitioning, and meta-heuristics [20]. Two examples of greedy methods are sequential forward and backward selection [21]. An example of mathematical programming is that proposed in [22], which uses successive linearization and a bilinear algorithm to select the feature subset through a parametric objective function [22]. Meanwhile, the use of nested partition method was demonstrated in the work of [23], and this approach was later extended to an adaptive version [24].

Another study proposed a stochastic gradient descent algorithm, in which each feature is given a weight based on its importance to classification to find the best subset of features [25]. A comparative study of four feature selection methods was presented, which use data mining approach in reducing the feature space. The final results show that among the four feature selection methods, the Gini index and information gain algorithms perform better [26]. Various meta-heuristic algorithms have also been proposed to improve feature selection. These include genetic algorithms (GA), simulated annealing (SA), and PSO. For instance, a GA was used in a previous study to optimize the feature subset and effectively model the parameters for SVM; this approach demonstrated good performance [27]. A method of genetic algorithm (GA) based neural network was proposed for feature selection [28]. In [29], the use of a particle swarm optimization (PSO) and a genetic algorithm (GA) (both augmented with support vector machines SVM) for the classification of high-dimensional microarray data. An SA approach was developed in another study to obtain the parameters for feature selection in SVM [30]. The authors named the approach SASVM. Furthermore, a hybrid method that combines artificial bee colony optimization and a differential evolving algorithm was proposed in [31] to improve feature selection and enhance classification accuracy. The method was tested on 15 datasets from the UCI repository and was determined to be successful [31]. This review of relevant literature indicates that a global search method is required to develop a successful feature selection algorithm. Evolutionary computation techniques are necessary because of their global search ability, and one of the best techniques is the particle swarm algorithm [32]. In a PSO-based approach, each particle moves toward its actual best position, so the best position is determined by the full swarm over several iterations. This concept is what guides PSO toward the optimal solution and is the basis upon which most types of PSO are developed [33]. For instance, the PSOSVM model, which is a hybrid of discrete PSO and SVM, was proposed to choose a appropriate feature subset of simulated data [34]. A modified PSO algorithm was also proposed to select a feature subset [35]. In another work, four feature selection approaches for feature preprocessing were combined; one of them is hybrid fuzzy a priori with PSO [36]. Meanwhile, a weighted binary swarm optimization method proposed for feature selection was modeled as a discrete optimization task [37].

2.2 Number of clusters

In most clustering methods, the number of clusters is already known. Thus, finding a way to enable an algorithm to automatically estimate the number of clusters when it is not known in advance remains a major challenge. A previous study attempted to overcome this problem [38]. In this previous study, a dynamic clustering approach based on PSO was proposed. A binary PSO was used to find the best number of clusters. Then, the center of the selected cluster was refined by K-means clustering. The authors applied their approach called DCPSO to an unsupervised image classification task with some success [38]. Another promising method is that proposed by [39], in which the classical PSO is modified to a kernel-induced similarity measure instead of a sum of squares distance. The proposed approach, which the authors named the multi-elitist PSO (MEPSO) method, can find the optimal number of clusters automatically [39]. A clustering method that can deal with different numbers of clusters has also been proposed; this method combines CPSO and K-means algorithms [40].

An improved version was subsequently proposed and called CPSOII. CPSOII, which is a combination of PSO and a dynamic clustering algorithm, can automatically find the best number of clusters and categorize data objects [41]. A robust PSO-based clustering method that considers the local density of data to measure how compact clusters are and that can also automatically estimate the number of clusters was put forward by [42] that method can deal well with noise.

Overall, this review indicates that most previous studies did not use BPSO to select the feature subset and estimate the number of clusters.

3 The proposed algorithm (MBPSO)

In BPSO, the position or state of each particle is a binary that can be changed from 1 to 0 or from 0 to 1 [16]. Particle velocity is defined as the probability that the state could change or mutate from 0 to 1 or vice versa. To discover the optimal solution, each particle changes its direction during its search for the feature space either because of its own particular best experience or cognitive learning (pbest) resulting from the best experience of all the other particles (i.e., the swarm’s collective social learning (gbest).

Each particle retains the ((pbest) value, which is the best fitness value, at position (Pi) and the best value at position (pbest). Each particle represents a candidate or solution and is considered a point in a D-dimension space. It is represented by its position and velocity.

In BPSO, the velocity of every particle is updated as follows

$$ v^{t}_{id}=v^{t-1}_{id}+c_{1}r_{1}\left( p^{t}_{id}-x^{t}_{id}\right)+c_{2}r_{2}\left( p^{t}_{gd}-x^{t}_{id}\right) $$
(1)

The changing positions of the particles are calculated by update functions. For instance, if a particle \(s\left (v^{t+1}_{id}\right )\) is larger than a random number between 0 and 1,its position is represented by 1 (i.e., the position is selected for update).However, if \(s\left (v^{t+1}_{id}\right )\) is smaller than a random number between 0 and 1,its position is represented by 0 (i.e., the position is not selected for update) [43].

$$ s\left( v^{t+1}_{id}\right)=\frac{1}{1+e^{-v^{t+1}_{id}}} $$
(2)

where s is the sigmoid function.

$$ If(\text{rand} < s\left( v^{t+1}_{id}\right)\quad \,\,\text{then}\,\, x_{id}=1\quad \,\,\text{else}\,\, x_{id}=0 $$
(3)

d = 1,2,…, D where c1 indicates the cognition learning and c2 indicates social learning. Usually c1 = c2 = 2, and r1 and r2 are random numbers uniformly distribution in U(0,1). Also, the velocity of the dimension is limited to \(v_{\max }\), and \(v_{\max }\) is specified by the parameters set by the user of the algorithm. Each particle then moves to a new potential solution based on (3).

Notably, the update function of BPSO does not consider the current position of a particle in a binary search space, so the choice of the next position is not influenced by the current position. Thus, velocity alone represents the particle, although the binary position already exists in the BPSO. Therefore, we used velocity and position similar to the original PSO. The update function of the original BPSO is changed as follows:

$$ x^{t+1}_{id}=x^{t}_{id}+v^{t}_{id} $$
(4)
$$ If\left( \text{rand} \!<\! \exp \left( {x_{id}^{t+1}\,-\, x_{id}^{t}}\right)\right)\,\,\,\,\, \text{then}\,\, x_{id}=1 \,\,\,\,\,\,\text{else}\,\, x_{id}\,=\,0 $$
(5)

where \( \exp \left ({x_{id}^{t+1}- x_{id}^{t}}\right )\) is the exponential function of the particle in two successive steps. In our approach, first, the number of particles required is set. Second, each particle’s initial coding string is produced randomly. In the process of selecting features, each particle is coded to mimic a chromosome in GA. In other words, each particle is coded into a binary alphabetical string x = a1, a2,..., an, f1f2...fD, where the first section of the particle (a1, a2,..., an) represents the numbers of clusters (0 or 1). The number of maximum clusters is n, and the numbers of 1s are counted to represent the number of clusters, where D is the number of features in the data. A bit value of 1 in second section f1f2...fD denotes a selected feature and a bit value of 0 denotes an unselected feature.

If \( \exp \left ({x_{id}^{t+1}- x_{id}^{t}}\right )\) is larger than a randomly produced number that is within (0, 1),its position value fm, m = 1, 2, …, D is represented as 1, which means that the feature is selected because it is required for the next update.

In the proposed MBPSO, the combination of PSO with a kernel-based fuzzy clustering algorithm allows the cluster number to be determined and feature selection to be performed. Each particle’s fitness is evaluated by KFCM. When the fitness of a particle is better than its best fitness, the position vector is saved for the particle. However, when its fitness is better than the global best fitness, then the position vector is saved for the global best. The particle’s velocity and position are updated until the stopping criterion or criteria are met. Using PSO can lead to fast convergence during optimization. Moreover, precision can be improved by combining PSO with a KFCM algorithm.

The MBPSO consists of the following steps:

  1. Step 1:

    Initialize the particles in the swarm population to provide them random positions and velocity vectors.

  2. Step 2:

    Measure the fitness of each particle position by using the KFCM algorithm as follows:

    1. 1-

      Input the dataset X = (x1, x2,..., xn) and remove the features from the data as the practicals positions.

    2. 2-

      Identify the number of groups (clusters)from the positions of the initial practicals and select the stopping criterion(the number of iterations or generations reach a prespecified value).

    3. 3-

      Choose the initial centers of the clusters from the data randomly and calculate the partition matrix uij by:

      $$ u_{ij} = \frac{\left( \frac{1}{(1-K(x_{i},c_{j}))^{\frac{1}{(m-1)}}}\right)}{{\sum}_{j=1}^{k}\left( \frac{1}{(1-K(x_{i},c_{j}))^{\frac{1}{(m-1)}}}\right)} $$
      (6)

      where

      $$ K(x,c) =\phi (x)^{T} \phi (c) $$
      (7)

      and is an inner product kernel function if we adopt the Gaussian function as a kernel function, i.e., \(K(x_{i},c_{j})=\exp \left (-\frac {\mid x_{i}-c_{j} \mid ^{2}}{\sigma ^{2}}\right )\).

      Since σ is presented as a dispersion, the sample variance is used to estimate σ2 with \(\sigma ^{2}=\frac {{\sum }_{j=1}^{n}(x_{j}-\overline {x})^{2}}{n},\overline {x}=\frac {{\sum }_{j=1}^{n}x_{j}}{n}\).

    4. 4-

      Update the center matrix cj by the formula:

      $$ c_{j} = \frac{{\sum}_{i=1}^{n} u^{m}_{ij}K(x_{i},c_{j}) x_{i}}{{\sum}_{i=1}^{n}u^{m}_{ij}K(x_{i},c_{j})} $$
      (8)

      where i = 1,2,..., n, j = 1,2,..., k, and m = 2 is the fuzziness parameter.

    5. 5-

      Calculate the objective function of each partition and select the minimum value for the fitness function value of each particle.

      $$ J(X,U,C)={\sum}_{i=1}^{n}{\sum}_{j=1}^{k}(u_{ij})^{m} \mid \phi (x_{i})-\phi (c_{j}) \mid^{2} $$
      (9)

      where

      $$ \mid \phi x_{i}-\phi c_{j} \mid^{2}=K(x_{i},x_{i})+K(c_{j},c_{j})-2K(x_{i},c_{j}) $$
      (10)
    6. 6-

      For convergence, test if the termination tolerance satisfies the following: ||CICI+1||≤ ε where I is the iteration number.

    7. 7-

      Select the particle that has the minimum fitness function value as the local and global solution.

  3. Step 3:

    Calculate the velocity of each particle for the next update.

  4. Step 4:

    Move each particle to its next updated position according to (45) and return to step 2 if the best position has not been found.

  5. Step 5:

    Stop the algorithm if the stopping criterion or criteria are satisfied or if the number of iterations reaches the predetermined maximum number.

4 The Gustafson –Kessel algorithm

The GK algorithm is a powerful fuzzy clustering technique that can be used in many applications, such as image processing and classification. The algorithm estimates the cluster covariance matrix, which enables it to match the distance metric to the cluster shape [44], which is a key advantage. The GK algorithm needs a set of n samples in the p dimensional space and the number of clusters k as the input parameters. A fuzzy partition of a dataset X can be represented by a (nk)U = [uij], where uij gives the degree of membership that the i th object belongs into the j th cluster, where (1 ≤ in) and (1 ≤ jk).

The GK algorithm has several similarities with the FCM algorithm. The main difference between them is that the FCM algorithm uses the square Euclidean distance measure, and the GK algorithm uses the Mahalanobis distance measure. The clusters formed by the FCM algorithm are spherical, and the cluster shape does not change according to the type of data. By contrast, in the GK algorithm, the cluster shape can change according to the data and can be created in several different forms, such as ellipsoidal and hyper-ellipsoidal. Hence, the GK algorithm employs a covariance matrix. The GK algorithm consists of the following steps.

  1. 1-

    Dataset X = (x1, x2,..., xn) is given.

  2. 2-

    Select the number of groups or clusters and the subset of features by modify binary particle swarm optimization (MBPSO) and select the termination condition.

  3. 3-

    Generate initial values for partition matrix Uij to denote the degree of membership of xi to the cluster j. This degree of membership satisfies the following constraints

    • 0 ≤ uij ≤ 1 for i ∈ 1,2,..., n, j ∈ 1,2,..., k

    • \({\sum }_{j=1}^{k}u_{ij}=1\) for i ∈ 1,2,..., n

  4. 4-

    Calculate the C-means matrix cj with a dimension (k × p) where 1 ≤ jk, that represents the center of the clusters by the following formula

    $$ c_{j} = \frac{{\sum}_{i=1}^{n} u^{m}_{ij} x_{i}}{{\sum}_{i=1}^{n}u^{m}_{ij}} $$
    (11)
  5. 5-

    Compute the cluster covariance matrix by:

    $$ F_{j} = \frac{{\sum}_{i=1}^{n} u^{m}_{ij}(x_{i},c_{j}) (x_{i}-c_{j})^{T}}{{\sum}_{i=1}^{n}u^{m}_{ij}} $$
    (12)
  6. 6-

    Compute the Mahalanobis distance by:

    $$ d^{2}_{ij} = (x_{i}-c_{j})A_{j} (x_{i}-c_{j})^{T} $$
    (13)

    Aj defined as following:

    $$ A_{j}=V_{j}[\det(F_{j})]^{\frac{1}{p}} F_{j}^{-1} $$
    (14)

    where xi is the i th data, p is the number of features or attributes, cj is the center of cluster j, Vj is the volume of cluster, and jVj = 1, \(F_{j}^{-1}\) is the inverse of matrix Fj.

  7. 7-

    Update the partition matrix uij by:

    $$ u_{ij} = \frac{1}{{\sum}_{r=1}^{k}\left( \frac{d_{ij}}{d_{rj}}\right)^{\frac {2}{m-1}}} $$
    (15)

    Where i = 1,2,..., n, j = 1,2,..., k and ir.

  8. 8-

    For convergence, test if the termination tolerance satisfies the following: ||UIUI+1||≤ ε

5 Measures of cluster validity

5.1 Internal measures

Several internal indices are used simultaneously, and the most important ones are described below.

  1. 1-

    The partition coefficient (PC) measures the amount of overlapping between clusters and is defined as follows [45]:

    $$ \text{PC} = \frac{1}{n}\sum\limits_{i=1}^{n} \sum\limits_{j=1}^{k}u_{ij}^{2} $$
    (16)

    where uij is the membership degree of the ith data point in the jth cluster. The best algorithm for partitioning the data is the one that produces the highest value of PC.

  2. 2-

    Classification entropy (CE) measures only the fuzziness of the cluster partitions, so it is similar to PC [45].

    $$ \text{CE} = -\frac{1}{n}\sum\limits_{i=1}^{n} \sum\limits_{j=1}^{k}u_{ij} \log(u_{ij}) $$
    (17)

    The best clustering algorithm is the one with the lowest value of CE.

  3. 3-

    The partitions index (SC) [46] is the ratio between the sum of the separation and the compactness of the clusters. It is the sum of the cluster validity measures for each individual divided by the fuzzy cardinality for each cluster.

    $$ \text{SC} =\sum\limits_{j=1}^{k} \frac{{\sum}_{i=1}^{n} u_{ij}^{m}|| x_{i}-c_{j}||^{2}}{{\sum}_{i=1}^{n} u_{ij} \text{sum}_{d=1}^{k}|| c_{d}-c_{j}||^{2}} $$
    (18)

    SC is useful for comparing different partitions with an equal number of clusters. A good partition is obtained by a low value of SC.

  4. 4-

    The separation index (S) [46] in contrast to the SC, uses a minimum-distance separation for partition validity. A lower value of S indicates a good partition.

    $$ S =\sum\limits_{j=1}^{k} \frac{\sum\limits_{i=1}^{n} u_{ij}^{2}|| x_{i}-c_{j}||^{2}}{n \min_{i\neq j}|| c_{i}-c_{j}||^{2}} $$
    (19)
  5. 5-

    The Xie and Beni’s (XB)index [47] aims to measure the proportion between the total variation within clusters and the separation of clusters. It is defined as follows:

    $$ XB =\sum\limits_{j=1}^{k} \frac{{\sum}_{i=1}^{n} u_{ij}^{m}|| x_{i}-c_{j}||^{2}}{n \min_{ij}|| x_{i}-c_{j}||^{2}} $$
    (20)

    This index focuses on separation and compactness properties. The clusters are well separated if XB has a small value.

  6. 6-

    The Dunn’s Index (DI) [47] aims to recognize dense and well-separated clusters. It is defined as the proportion between the minimal intra-cluster distance and the maximal inter-cluster distance. For each cluster partition, this index can be identified as follows:

    $$ DI =\min_{j\in k}\left\lbrace \min_{j\in k (i\neq j)}\left\lbrace\frac{\min_{x\in c_{i} y\in c_{j}}d(x,y)}{\max_{x,y\in k}d(x,y)} \right\rbrace \right\rbrace $$
    (21)

    A high Dunn’s index denotes that the a desirable algorithm is suitable for producing clusters.

  7. 7-

    Davies-Bouldin index (DB) [48] is defined as follows:

    $$ \text{DB}=\frac{1}{n}\sum\limits_{i=1}^{n} \max_{i\neq j}\left( \frac{d_{i}+d{j}}{d(c_{i},c_{j})} \right) $$
    (22)

    where (k) is the number of clusters, cj and ci are the centers of clusters, and respectively dj and di are the average distances of all the elements in clusters (j) and (i) respectively and d(ci, cj) is the distance between the centers ci and cj. The best algorithm is the clustering algorithm that produces a collection of clusters with the smallest DB index.

5.2 External measures

External measures indicate the quality of the resulting partitioning; thus, they can be considered tools that can help experts evaluate the clustering results. The fuzzy Rand index is a well-known measure of similarity between two partitions of a dataset [49].

Given a fuzzy partition w = {W1, W2,…, Wk} of X, each element xX can be characterized by its membership vector

$$ W(x) = (W_{1}(x), W_{2}(x), \ldots, W_{k}(x))\in\left[ 0,1\right]^{k} $$
(23)

where Wi(x) is the degree of membership of x in the i th cluster Wi. A similarity measure for associated membership vectors can be formed as follows:

$$ E_{W}(x, x^{\prime}) = 1 - ||W(x) - W(x^{\prime})|| $$
(24)

where ||.|| is a proper metric on \(\left [0, 1\right ]^{k}\). If W and Z are two fuzzy partitions to generalize the concept of concordance, a pair \((x; x^{\prime })\) is defined and the degree of concordance is:

$$ \text{conc}(x, x^{\prime}) = 1 -||E_{W}(x, x^{\prime}) - E_{Z}(x, x^{\prime})|| \in\left[0 1\right] $$
(25)

the degree of discordance is

$$ \text{disc}(x, x^{\prime}) = 1 -||E_{W}(x, x^{\prime}) - E_{Z}(x, x^{\prime})|| $$
(26)

The distance measure for the fuzzy partitions is then defined by the normalized sum of the degrees of discordance as follows:

$$ d(W,Z)= \frac{{\sum}_{(x,x^{\prime})\in X}||E_{W}(x, x^{\prime}) - E_{Z}(x, x^{\prime})||}{(N(N-1)/2)} $$
(27)

Likewise,

$$ \text{RE}(W,Z) = 1 - d(W,Z) $$
(28)

This condition corresponds to the normalized degree of concordance and is a direct generalization of the original Rand index. The Rand index is a similarity measure that assumes values between 0 and 1. If the value is near 1 this means that ith the cluster in W and the ith cluster in Z are identical thus W = Z.

6 Results and discussion

6.1 Data description

To evaluate the performance of the proposed approach, Australian, German, and Taiwanese datasets from the UCI machine learning repository were used. Table 1 shows the characteristics of the datasets. The input variables were scaled during the data preprocessing stage. The main advantage of scaling is that it prevents attributes in greater numerical ranges from dominating over those in smaller numerical ranges. Another advantage is that it can prevent numerical difficulties during calculation. Scaling of the feature value can also help increase accuracy according to our experimental results. Generally, each feature can be linearly scaled to the [0, 1] range by using the following formula

$$ x^{1}=\frac{x-\min_{x}}{\max_{x}-\min_{x}} $$
(29)

where x is original value, x1 is the scaled value, max is the maximum value of feature x, and min is the minimum value of feature x.

Table 1 Dataset description

The following tables show the internal indices for the clustering of the Australian, German, and Taiwanese credit data. The first column shows the values of the indices of the GK algorithm, and the second column presents the proposed algorithm (GK+BPSO) with three cases. The first one is for feature selection only, the second one is for determining the number of clusters, and the third is for both feature selection and determining the number of clusters. The third column shows GK+MBPSO for the three cases.

Table 2 shows the values of the internal indices for Australian data for GK, GK+BPSO, and GK+MBPSO. The value of the first index (PC) for the GK+MBPSO algorithm is near 1, which is greater than its value for GK and GK+BPSO. The value of the second index (CE) for the GK+MBPSO algorithm is less than the value for GK and GK+BPSO. The values of the other indices (SC, S, and XB) for the GK+MBPSO algorithm are lower than the values for the GK and GK+BPSO algorithms. However, the value of DI for the proposed method is greater than the values for the other methods, and the value of DB for the GK+MBPSO algorithm is lower than the values for GK and GK+BPSO algorithms. This result means that the algorithms separated clusters well. The proposed method determined that the number of clusters is five, and it selected eight features.

Table 2 The validity measures of Australian credit data

Table 3 shows the internal indices for German data for GK, GK+BPSO, and GK+MBPSO. The value of the first index (PC) for the GK+MBPSO algorithm is near 1, which is greater than its value for GK and GK+BPSO. The value of the second index (CE) for the GK+MBPSO algorithm is less than the value for GK and GK+BPSO. The values of the other indices (SC, S, and XB) for the GK+MBPSO algorithm are lower than the values for GK and GK+BPSO algorithms. However, the value of DI for the proposed method is greater than the values for the other methods, and the DB for GK+MBPSO is lower than the values for GK and GK+BPSO. This result means that the algorithms separated clusters well. The proposed method determined that the number of clusters is four, and it selected 10 features.

Table 3 The validity measures of German credit data

Table 4 shows the internal indices for Taiwan data for GK, GK+BPSO, and GK+MBPSO. The value of the first index (PC) for the GK+MBPSO algorithm is near 1, which is greater than its value for GK and GK+BPSO. The value of the second index (CE) for the GK+MBPSO algorithm is less than the value for GK and GK+BPSO. The values of the other indices (SC, S, and XB) for the GK+MBPSO algorithm are lower than the values for GK and GK+BPSO algorithms. However, the value of DI for the proposed method is greater than the values for the other methods, and the DB for the GK+MBPSO algorithm is lower than that for GK and GK+BPSO algorithms. This result means that the algorithms separated clusters well. The proposed method determined that the number of clusters is five, and it selected 10 features.

Table 4 The validity measures of Taiwan credit data

As shown in the summarized results, the proposed modified method (MBPSO) for determining the number of clusters and for feature selection with the GK algorithm (GK+MBPSO) exhibits the best performance for the three datasets because it has a smaller distance function (objective function) and a smaller number of iterations for the three datasets. A t test was conducted on the internal index values of the proposed method (GK+MBPSO) and GK for the three datasets. The results demonstrate that significant differences exist between them at 95%, and the P value of proposed method (MBPSO+GK) and GK for the three datasets is 0.025, 0.033, and 0.043. As we compare with method of [31] the number of iterations is 300 iterations and repeated 10 times but our method do not exceed 40 iteration for three datasets. The number of data in their method is from 30–50 but our method the size of data 690,1000 and 30000.

Table 5 shows that the results of the fuzzy Rand validity measures for the Australian, German, and Taiwanese credit datasets are 0.9911, 0.9955, and 0.9933, respectively. The values of the (GK+MBPSO algorithm) are greater than the values of the two other methods. This finding means that fuzzy partition (classification) is robust, so the risk associated with loans can be reduced with this method.

Table 5 The fuzzy rand validity measure of three credit data

7 Conclusion

We proposed a new modified BPSO-KFCM method for determining the number of clusters and for selecting features in fuzzy data clustering. We developed and improved the GK algorithm to increase classification accuracy for cluster analysis. The three algorithms were applied to Australian, German, and Taiwanese credit datasets, and their performance was compared. The cluster internal validity indices of the proposed method (GK+MBPSO) are better than those of the other algorithms. The t test on the internal indices of the proposed method (GK+MBPSO) demonstrated that significant differences exist among the methods at 95%. The results of the fuzzy Rand validity measures show that fuzzy partition (classification) is robust, so the risk associated with loans can be reduced with this method. In future work, other validation measures can be utilized to test the effectiveness of the proposed approach for cluster analysis. Moreover, the modified BPSO-KFCM can be improved to select the initial centers of clusters integrated with feature selection.

The cluster internal validity indexes confirm that the performance of the proposed algorithm (GK+MK) is better than that of the GK and GKK algorithms. A fuzzy validity index is applied in this paper for evaluating the fitness of clustering to data sets.