Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Credit approval is one of the most critical decisions of banking requiring solid risk analysis. Credit scoring systems are introduced to evaluate the customers’ eligibility for credit approval based on historical and current information about the customers. This information can be numeric such as income, age, volume of previous credit history as well as nominal-categorical such as sex, race, type of criminal record, and so on. Although processing such nominal-categorical variables can be easy by simple credit scoring systems, it can be difficult to handle them in more sophisticated statistical methods for credit approval decision making.

Traditional techniques such as discriminant analysis and logistics regression suffer in the presence of nominal-categorical data. When the variables are nominal (categorical) definitions of the similarity (dissimilarity) measures become difficult and it requires a new metric. In this paper, our objective is to introduce a new approach for supervised classification using a hybrid radial basis function neural networks (HRBF-NN) with continuity justification on dependent variable so as to handle mixture of nominal-categorical and continuous predictors without using dummy variables for classification. We illustrate the practical utility and the importance of our approach by providing a real example on a benchmark credit approval data from the banking industry to classify good and bad customers. Most of the technical details of this paper can be found in Akbilgic et al. (2013), Akbilgic (2011), Akbilgic and Bozdogan (2011). Here, we only recapitulate the necessary parts from these papers to set up the background of this current paper.

The paper is organized as follows. In Section 2, we briefly explain HRBF-NN and what radial basis function neural network (RBF-NN) model is. In Section 3, we discuss classification trees (CT) and its usage in HRBF-NN model; transforming tree nodes into RBFs. Estimation of the weight parameters is presented in Section 4 using the least-squares method. Later, we explain how to make classification problem look like non-parametric regression by adding a threshold function into output neuron of RBF-NN model. Our threshold function turns out to be a non-linear function of the predictive model. For other threshold selection methods, we refer the readers to Flach et al. (2013) in this volume. In Section 5, for model selection, we develop and use information-theoretic measure of complexity (ICOMP) criterion as our fitness function and show its derived form under both correctly and misspecified HRBF-NN models. We also give the forms of Akaike’s information criterion (AIC) (Akaike 1973; Bozdogan 1987) and Rissanen/Schwarz (MDL/SBC) (Rissanen 1978; Schwarz 1978). In Section 6, we briefly explain the background of the genetic algorithm (GA) and the implementation of GA for the subset selection of the best predictors which discriminate between the classes. In Section 7, we give a numerical example to illustrate the performance of the proposed new supervised classification approach via the HRBF-NN model on a real credit approval data set to classify the customers into good/bad credit card customers or classes. Later, in Section 8, we conclude the paper with a discussion.

2 Hybrid Radial Basis Function Neural Networks: HRBF-NN Model

In this section, we briefly introduce the structure of HRBF-NN model as a combination of RBF-NNs, classification trees, ridge regression, information complexity ICOMP, and the genetic algorithm (GA).

2.1 RBF-NN Model

RBF-NNs model is a technique that transforms non-linearly separable features to linearly separable features using radial basis functions (RBFs). RBF-NN model is a nonparametric regression technique (Bishop 1995) defined as

$$\displaystyle{ y = f(w,x) =\sum _{ j=1}^{m}w_{ j}h_{j}(x) = w_{1}h_{1} + w_{2}h_{2} + \cdots + w_{m}h_{m}. }$$
(1)

In equation (1), y is the dependent variable, \(x_{1},x_{2},\ldots,x_{m}\) are independent variables, \(\left \{h_{j}(x)\right \}_{j=1}^{m}\), and \(\left \{w_{j}\right \}_{j=1}^{m}\) are the unknown adaptable coefficients, or weights. Equation (1) is represented in matrix form in equation (2) where H is the (n × mdesign matrix, and \(\varepsilon\) is an (n × 1) vector of random noise term, such that

$$\displaystyle{ y = \mathit{Hw} +\epsilon. }$$
(2)

2.2 Radial Basis Functions

RBF-NN gains its flexibility from RBFs. We shall consider four most common RBFs in this work although there are many others. These are Gaussian (GS), Cauchy (CH), multi-quadratic (MQ), and inverse multi-quadratic (IMQ) which are given in Table 1.

Table 1 The most common radial basis functions

The RBF-NN non-linearly transforms n-dimensional inputs to m-dimensional space by m basis functions, each characterized by their centers c j in the (original) input space and a width or radius vector r j , \(j \in \left \{1,2,\ldots,m\right \}\) (Orr 2000).

3 Classification Trees and its Use in HRBF-NN Model

3.1 Classification Trees

Classification and regression trees (or CART in short) models are used for both prediction and classification. Classification trees algorithm is based on recursively partitioning of the input space into two parallel hyper-rectangles. The hyper-rectangles with no more splits are called terminal nodes of the tree and a class label is assigned for each terminal nodes. The class assignment rule for a terminal node is simply to correspond the class label having the largest number of members in the terminal node (Sutton 2005).

During the process of recursive partitioning of input space, each split is parallel to one of the axes and can be expressed as an inequality involving of the input components (e.g. x k  > b). The input space is divided into hyper-rectangles organized into a binary tree where each branch is determined by the dimension \(\left (k\right )\) and boundary \(\left (b\right )\) which together minimize the misclassification error (Orr 2000). The root node of the classification tree is the smallest hyper-rectangle that will include all of the training data \(\left \{x_{i}\right \}_{i=1}^{p}\). Its size s k (half-width) and center c k in each dimension k are

$$\displaystyle\begin{array}{rcl} s_{k} = \frac{1} {2}(\mathop{\mathrm{max}}\limits_{i \in S}(x_{\mathit{ik}}) -\mathop{\mathrm{ min}}\limits_{i \in S}(x_{\mathit{ik}})& &{}\end{array}$$
(3)
$$\displaystyle\begin{array}{rcl} c_{k} = \frac{1} {2}(\mathop{\mathrm{max}}\limits_{i \in S}(x_{\mathit{ik}}) +\mathop{\mathrm{ min}}\limits_{i \in S}(x_{\mathit{ik}}))& &{}\end{array}$$
(4)

where k ∈ K is the set of predictor indices, and \(S = \left \{1,2,\ldots,p\right \}\) is the set of training set indices. A split of the root node divides the training samples into left and right subsets, S L and S R , on either side of a boundary b in one of the dimensions k such that

$$\displaystyle\begin{array}{rcl} s_{L} = \left \{i: x_{\mathit{ik}} \leq b\right \},& &{}\end{array}$$
(5)
$$\displaystyle\begin{array}{rcl} s_{R} = \left \{i: x_{\mathit{ik}}> b\right \},& &{}\end{array}$$
(6)

In classification trees, for a given set of class labels \(\left \{A_{1},A_{2},,A_{3}\ldots \right \}\), the output values of each side of the bifurcations are

$$\displaystyle\begin{array}{rcl} \hat{y}_{L} = A_{\mathrm{argmax}_{i\epsilon s_{ L}}\{a_{i}\}}& &{}\end{array}$$
(7)
$$\displaystyle\begin{array}{rcl} \hat{y}_{R} = A_{\mathrm{argmax}_{i\epsilon s_{ R}}\{a_{i}\}}& &{}\end{array}$$
(8)

where the number of members of class label in each subset is defined with the set \(a = \left \{a_{1},a_{2},a_{3}\ldots \right \}\). The misclassification error (MCE) rate is then

$$\displaystyle{ \mathrm{MCE}(k,b) = \frac{\sum _{i\epsilon s_{L}}M(y_{i},\hat{y}_{L}) +\sum _{i\epsilon s_{R}}M(y_{i},\hat{y}_{L})} {n}, }$$
(9)

where n is the total sample size, and \(M(y_{i},\hat{y}_{L})\) is a function equal to 0 if \(y_{i} =\hat{ y}\), and 1 otherwise.

The split which minimizes \(\mathrm{MCE}\left (k,b\right )\) over all possible choices of k and b is used to create the children of the root node and is found by simple discrete search over m dimensions and p observations. The children of the root node are split recursively in the same manner and the process terminates when every remaining split creates children containing fewer than p min samples, which is another parameter of the method. The children are shifted with respect to their parent nodes and their sizes reduced in the k-th dimension (Akbilgic et al. 2013; Akbilgic 2011; Akbilgic and Bozdogan 2011).

3.2 Transforming Tree Nodes into RBFs

The classification trees contain a root node, some non-terminal nodes (having children) and some terminal nodes (having no children). Each node is associated with a hyper-rectangle of input space having a center c and size s as described above. The node corresponding to the largest hyper-rectangle is the root node and it is divided up into smaller and smaller pieces progressing down the tree (Breiman et al. 1984; Orr 2000). To transform the hyper-rectangle into different basis kernel RBFs we use its center c as the RBF center and its size s, scaled by a parameter α as the RBF radius given by

$$\displaystyle{ r =\alpha s. }$$
(10)

The scalar α has the same value for all nodes (Kubat 1998), and it is another parameter of the method. In this study we set \(\alpha = \sqrt{2}\alpha _{K}^{-1}\) where α K is the Kubat’s parameter (Kubat 1998; Orr 2000).

4 Estimation of Weight Parameters

4.1 Least-Squares Estimation

Given a network model in equation (1) consisting of m RBFs with centers \(\left \{c_{j}\right \}_{j=1}^{m}\) and radii \(\left \{r_{j}\right \}_{j=1}^{m}\) and a training set with p patterns, \(\left \{\left (x_{i},y_{i}\right )\right \}_{i=1}^{p}\), the optimal network weights can be found by minimizing the sum of squared errors:

$$\displaystyle{ \mathrm{SSE} =\sum _{ i=1}^{p}\left (f(x_{ i}) - y_{i}\right )^{2} }$$
(11)

and is given by

$$\displaystyle{ \hat{w} = \left (H^{{\prime}}H\right )^{-1}H^{{\prime}}y }$$
(12)

the so-called least squares estimation. Here H is the design or model matrix, with its elements \(H_{\mathit{ij}} = h_{j}(x_{i})\), and \(y = \left (y_{1},y_{2},\ldots,y_{p}\right )^{{\prime}}\) is the p-dimensional vector of training set of output values.

In RBF-NN, one of the most common problems is singularity of the (H H) matrix. At this point, to overcome possible singularity problem in the model matrix, we use global ridge regression (Tikhonov and Arsenin 1977; Bishop 1991) to regularize HRBF-NN model with the cost function given by

$$\displaystyle{ C(w,\lambda ) =\sum _{ i=1}^{p}\left (f(x_{ i}) - y_{i}\right )^{2} +\lambda \sum _{ i=1}^{m}w_{ j}^{2} =\varepsilon ^{{\prime}}\varepsilon + w^{{\prime}}w.\ }$$
(13)

C(w, λ) is minimized to find a weight vector which is more robust to noise in the training set. The optimal weight vector for global ridge regression is given in equation (14), where I m is the m dimensional identity matrix, and λ is the regularization parameter.

$$\displaystyle{ \hat{w} = \left (H^{{\prime}}H +\lambda I_{ m}\right )^{-1}H^{{\prime}}y. }$$
(14)

We use Hoerl, Kennard, and Baldwin (HKB) (Hoerl et al. 1975) approach to data adaptively determine optimal λ that is given by

$$\displaystyle{ \hat{\lambda }_{\mathrm{HKB}} = \frac{\mathit{ms}^{2}} {\hat{\mathbf{w}}_{\mathit{LS}}^{{\prime}}\hat{\mathbf{w}}_{\mathit{LS}}}, }$$
(15)

where m = k, the number of predictors not including the intercept term, n is the number of observations, s 2 is the estimated error variance using k predictors so that

$$\displaystyle{ s^{2} = \frac{1} {\left (n - k + 1\right )}\left (y - H\hat{\mathbf{w}}_{\mathit{LS}}\right )^{{\prime}}\left (y - H\hat{\mathbf{w}}_{\mathit{ LS}}\right ), }$$
(16)

where \(\hat{\mathbf{w}}_{\mathit{LS}}\) is the estimated coefficient vector obtained from a no-constant model given in matrix form by

$$\displaystyle{ \hat{\mathbf{w}}_{\mathit{LS}} = \left (H^{{\prime}}H\right )^{-1}H^{{\prime}}y. }$$
(17)

4.2 RBF Neural Networks for Classification

The goal of classification is to assign observations into target categories or classes based on their characteristics in some optimal way. Thus, in classification case, outcomes are one of discrete set of possible classes rather than of a continuous function as in non-parametric regression (Bishop 1995). However, we can make the classification problem look like a non-parametric regression by incorporating a threshold function into the output of the neuron of the RBF-NN model.

For a binary dependent variable case, we can assign HRBF-NN predictions to class labels by substituting equation (1) in the threshold function \(t\left (f(w,H);t_{0}\right )\) given by

$$\displaystyle{ t\left (f(w,H)\right ) = \left \{\begin{array}{ll} 0&\quad f(w,x) <t_{0} \\ 1&\quad f(w,x)> t_{0} \end{array} \right. }$$
(18)

where t 0 is the value separating two classes.

When two clusters have equal number of observations, then t 0 = 0. 5.

Assuming that the classes are represented with 0, and 1 and having n 1, and n 2, the number of observations in each class, the calculation of threshold value is given by

$$\displaystyle{ t_{0} = \frac{n_{1}} {n_{1} + n_{2}}. }$$
(19)

Threshold value can be considered as a prior probability of the first group which is equal to 0. 5 when two of the groups have equal number of observations.

5 Information Theoretic Model Selection Criteria

In HRBF-NN, we use ICOMP criterion of Bozdogan (199420002004) and Liu and Bozdogan (2004) as the fitness function to carry out variable selection with GA. The complexity of a nonparametric regression model increases with the number of independent and adjustable parameters, which is also termed effective degrees of freedom in the model. According to the qualitative principle of Occam’s Razor, the simplest model that fits the observed data is the best model. Following this principle, we aim to provide a trade-off between how well the model fits the data and the model complexity (Akbilgic et al. 2013).

The derived forms of information criteria are used to evaluate and compare different horizontal and vertical subset selection in the genetic algorithm (GA) for the regularized regression and classification trees and RBF networks model given in equation (1) under the assumption, \(\varepsilon \sim N\left (0,\sigma ^{2}I\right )\) or equivalently \(\varepsilon _{i} \sim N\left (0,\sigma ^{2}\right )\) for \(i = 1,2,\ldots,n\).

General form of ICOMP is an approximation to the sum of two Kullback–Leibler (KL) (Kullback and Leibler 1951) distances. For general multivariate normal linear or nonlinear structural model suppose \(C_{1}\left (\hat{\varSigma }_{\mathrm{model}}\right )\) is approximated by the complexity of the IFIM \(C_{1}\left (\hat{\mathcal{F}}^{-1}\left (\hat{\theta }\right )\right )\). Then, we define ICOMP(IFIM) as

$$\displaystyle{ \mathrm{ICOMP}(\mathrm{IFIM}) = -2\log L\left (\hat{\theta }\right ) + 2C_{1}\left (\hat{\mathcal{F}}^{-1}\left (\hat{\theta }\right )\right ), }$$
(20)

where \(C_{1}\left (.\right )\) is a maximal information theoretic measure of complexity of the estimated inverse Fisher information matrix (IFIM) of a multivariate normal distribution given by

$$\displaystyle{ C_{1}\left (\hat{\mathcal{F}}^{-1}\left (\hat{\theta }\right )\right ) = \frac{s} {2}\mathrm{log}\left (\frac{\mathrm{tr}\left (\hat{\mathcal{F}}^{-1}\left (\hat{\theta }\right )\right )} {s} \right ) -\frac{1} {2}\mathrm{log}\mid \hat{\mathcal{F}}^{-1}\left (\hat{\theta }\right )\mid, }$$
(21)

and where \(s = \mathrm{dim}\left (\hat{\mathcal{F}}^{-1}\right ) = \mathrm{rank}\left (\hat{\mathcal{F}}^{-1}\right )\). The estimated IFIM for the HRBF-NN model is given by

$$\displaystyle{ \widehat{\mathrm{Cov}}\left (\hat{w},\hat{\sigma }^{2}\right ) =\hat{ \mathcal{F}}^{-1} = \left [\begin{array}{cc} \hat{\sigma }^{2}\left (H^{{\prime}}H\right )^{-1} & 0 \\ 0 &\frac{2\hat{\sigma }^{4}} {4} \end{array} \right ], }$$
(22)

where

$$\displaystyle{ \hat{\sigma }^{2} = \frac{\left (y - H\hat{w}\right )^{{\prime}}\left (y - H\hat{w}\right )} {n}. }$$
(23)

Then, the definition of ICOMP(IFIM) in equation (20) becomes:

$$\displaystyle{ \mathrm{ICOMP}(\mathrm{IFIM}) = \mathit{n}\mathrm{log}\left (2\pi \right ) + n\mathrm{log}\left (\hat{\sigma }^{2}\right ) + n + 2C_{ 1}\left (\hat{\mathcal{F}}^{-1}\left (\hat{\theta }\right )\right ), }$$
(24)

where the entropic complexity is

$$\displaystyle\begin{array}{rcl} C_{1}\left (\hat{\mathcal{F}}^{-1}\left (\hat{\theta }_{ m}\right )\right )& =& \left (m + 1\right )\mathrm{log}\left [\frac{\mathrm{tr}\hat{\sigma }^{2}\left (H^{{\prime}}H\right )^{-1} + \frac{2\hat{\theta }^{4}} {4} } {m + 1} \right ] \\ & & -\frac{1} {2}\mathrm{log}\mid \hat{\sigma }^{2}\left (H^{{\prime}}H\right )^{-1}\mid + \mathrm{log}\left (\frac{2\hat{\sigma }^{4}} {4} \right ).{}\end{array}$$
(25)

We can also define ICOMP for misspecified models given as follows:

$$\displaystyle\begin{array}{rcl} \mathrm{ICOMP}(\mathrm{IFIM})_{\mathrm{Misspec}}& =& -2\mathrm{logL}\left (\hat{\theta }\right ) + 2C_{1}\left (\widehat{\mathrm{Cov}}\left (\hat{\theta }\right )_{\mathrm{Misspec}}\right ) \\ & =& \mathit{n}\mathrm{log}\left (2\pi \right ) + n\mathrm{log}\left (\hat{\sigma }^{2}\right ) + n + 2C_{ 1}\left (\widehat{\mathrm{Cov}}\left (\hat{\theta }\right )_{\mathrm{Misspec}}\right ),{}\end{array}$$
(26)

where

$$\displaystyle{ \widehat{\mathrm{Cov}}\left (\hat{\theta }\right )_{\mathrm{Misspec}} =\hat{ \mathcal{F}}^{-1}\hat{\mathcal{R}}\hat{\mathcal{F}}^{-1} }$$
(27)
$$\displaystyle{\left [\begin{array}{ll} \hat{\sigma }^{2}(H^{{\prime}}H)^{-1} & 0 \\ 0 &\frac{2\hat{\sigma }^{4}} {n} \end{array} \right ]\left [\begin{array}{ll} \frac{1} {\hat{\sigma }^{4}} H^{{\prime}}D^{2}H &H^{{\prime}}1\frac{\mathit{Sk}} {2\hat{\sigma }^{3}} \\ \big(H^{{\prime}}1\frac{\mathit{Sk}} {2\hat{\sigma }^{3}} \big)^{{\prime}}&\frac{(n-m)(\mathit{Kt}-1)} {4\hat{\sigma }^{4}} \end{array} \right ]\left [\begin{array}{ll} \hat{\sigma }^{2}(H^{{\prime}}H)^{-1} & 0 \\ 0 &\frac{2\hat{\sigma }^{4}} {n} \end{array} \right ]}$$

is a consistent estimator of the covariance matrix \(\mathrm{Cov}(\theta _{k}^{{\ast}})\), which is often called the sandwich covariance or robust covariance estimator, since it is a correct covariance regardless whether the assumed model is correct or not. When the model is correct we get \(\hat{\mathcal{F}} =\hat{ \mathcal{R}}\). Hence, the sandwich covariance reduces to the usual IFIM \(\hat{\mathcal{F}}^{-1}\) (White 1982). Note that this covariance matrix takes into account the presence of skewness and kurtosis, which is not possible with AIC (Akaike 1973) and other Akaike-type criteria such as Rissanen/Schwarz (MDL/SBC) (Rissanen 1978; Schwarz 1978). The derived forms of these criteria for the HRBF-NN model are:

$$\displaystyle\begin{array}{rcl} \mathrm{AIC}(m) = n\log (2\pi ) + n\log \left (\frac{(y - H\hat{w})^{^{{\prime}} }(y - H\hat{w})} {n} \right ) + n + 2(m + 1),& &{}\end{array}$$
(28)
$$\displaystyle\begin{array}{rcl} \mathrm{MDL/SBC}(m) = n\log (2\pi ) + n\log \left (\frac{(y - H\hat{w})^{^{{\prime}} }(y - H\hat{w})} {n} \right ) + n + m\log (n).& &{}\end{array}$$
(29)

6 Genetic Algorithm for Subset Selection

There are several standard techniques available for variable selection such as forward selection, backward elimination, a combination of the two, or all possible subset selection. Both forward and backward procedures cannot deal with the collinearity in the predictor variables. Major criticisms on the forward, backward, and stepwise selection are that, little or no theoretical justification exists for the order in which variables enter or exit the algorithm. Stepwise searching rarely finds the overall best model or even the best subsets of a particular size. Stepwise selection, at the very best, can only produce an “adequate” model.

All possible subset selection is a fail proof method, but it is not computationally feasible. It takes too much time to compute and it is costly. For 20 predictor variables, for the usual subset regression model, total number of possible models we need to evaluate is: 220 = 1, 048, 576. At this point, we use genetic algorithm to carry out variable selection in HRBF-NN with ICOMP as the fitness function.

Genetic algorithm is a robust evolutionary optimization search technique with very few restrictions (David and Alice 1996). GA treats information as a series of codes on a binary string, where each string represents a different solution for a given problem. It follows the principles of survival of the fittest, which is introduced by Charles Darwin. The algorithm searches for optimum solution within a defined search space to solve a problem (Eiben and Smith 2010). It has outstanding performance in finding the optimal solution for problems in many different fields (Akbilgic et al. 2013; Akbilgic 2011; Akbilgic and Bozdogan 2011).

7 A Numerical Example: Analysis of Credit Approval Data

In this section, we report our computational results on a credit approval data sets to classify the customers into good/bad classes using our hybrid RBF-NN approach with regularization, GA, and ICOMP(IFIM) as the fitness function.

Our modern world depends upon credit. Entire economies are driven by people’s ability to “buy-now, pay later” (Anderson 2007).

Therefore, credit approval is one of the most critical decisions of banking industry requiring solid risk analysis.

Credit scoring systems are introduced almost 50 years ago to evaluate the customers’ eligibility for credit approval based on historic and current information about the customers.

This information can be numeric such as income, age, volume of previous credit history as well as nominal-categorical such as sex, race, type of criminal record, and so on with high dimensions.

Our credit approval data set is obtained from UCI Machine Learning Repository (2013). Original version of credit approval data set is consisted of 690 observations including fifteen independent variables; six continuous and nine categorical, and one binary dependent variable. However, by excluding the observation with missing attributes, we reduced the data size to 654 representing 296 positive, and 358 negative credit ratings. Because all of the nine categorical independent variables were coded by meaningless letters to protect confidentiality of the data, we transformed them into numbers, \(1,2,3,\ldots\), based on the number of categories in each variable. The representation of the original data and the usage of them in our study are given in Table 2.

Table 2 Usage of credit approval data in our analysis

We first analyzed credit approval data via HRBF-NN model separately for four different RBFs: Gaussian, Cauchy, Multi-Quadratic, and Inverse Multi-Quadratic using saturated model. Confusion matrix for different RBFs are reported in the Tables 3, 4, and 5 where ICOMP(IFIM)miss values are reported in the last column of Table 8. For simplicity in text, we will use ICOMP for ICOMP(IFIM)miss in our report in this study. Note that calculation of classification accuracy is carried out using equation (30). The reason we run HRBF-NN model for saturated model is to compare the results after variable selection. The classification accuracy is defined by

$$\displaystyle{ \mbox{ Classification accuracy} = 100\frac{\mbox{ number of correctly classified observations}} {\mbox{ total number of observations}}. }$$
(30)
Table 3 Gaussian
Table 4 Cauchy
Table 5 MQ

Tables 3, 4, 5, and 6 show the high performance of HRBF-NN model for classification of credit data which is approximately 90 %. At this point we run variable selection on credit data using GA with ICOMP as the fitness function. Parameter setting of GA is based on our previous studies on HRBF-NN model (Akbilgic et al. 2013). Thus, we set our GA parameters as given in Table 7.

Table 6 Inverse MQ RBF
Table 7 Parameter setting of GA for variable selection

After finishing the first stage of analysis for saturated model and setting the GA parameters, next, we carried out variable selection for credit data using GA separately for four different RBFs. Table 8 shows the selected variable subsets and minimized ICOMP values under selected variable subsets for different RBFs. We also showed the ICOMP values we calculated before for saturated model in Table 8 to give a better comparison.

Table 8 Variable selection under different RBFs

It is noted from Table 8 that ICOMP values for selected subsets are significantly lower than the ICOMP values calculated for saturated model. At this point, it is important to see if obtained lower ICOMP values correspond to a simple model giving good classification accuracy. To show this, we run HRBF-NN model for all four of the RBFs with corresponding selected best subsets given in Table 8. Confusion matrix and classification accuracy is calculated for each case and the results are reported in Tables 9, 10, 11, 12.

Table 9 Gaussian RBF
Table 10 Cauchy RBF
Table 11 MQ RBF
Table 12 IMQ RBF

The important results appearing in Tables 9, 10, 11, and 12 show that variable selection within HRBF-NN allows us to reduce dimension of input variables without any loss in classification accuracy. Comparing the classification accuracy results between saturated model and best subsets shows the similarity of classification performance while the dimensionality is significantly reduced for best subsets. According to Table 8, by carrying out variable selection with Gaussian RBF has resulted in selecting a subset with only five variables out of fifteen where ICOMP value is minimized. Note that, there is even slightly better classification accuracy for best subset selected for Gaussian RBF in comparison with classification accuracy for the saturated model.

Finally, for comparison purposes, we carried out the usual logistic regression analysis, although the assumptions are violated here for this data set, we obtained a classification accuracy of 87. 1 % using stepwise variable selection which gave nine predictors as the best predictors including the constant term. These nine predictors are: 0, 4, 5, 7, 8, 9, 10, 11, and 15. Note that this subset does not include variables 3, 6, 9, 10, and 14 obtained from our results.

8 Conclusions and Discussion

In this paper, we introduced a novel approach for supervised classification using a HRBF-NN model with ICOMP. Our study shows that HRBF-NN model is a highly clever technique to handle hard classification problems even if the data is mixture of continuous and categorical variables. We demonstrated that the GA is a powerful optimization tool for selecting the best subset of predictors that discriminate between the classes or groups. HRBF-NN using ICOMP with GA provides us a flexible variable selection and at the same time a classification tool which gives better results than the full saturated model. With our approach we can now provide a practical method for choosing the best kernel basis RBF for a given data set which was not possible before in the literature of RBF based-methods. In real-world applications, we frequently encounter data sets with 100 and 1,000 of variables. Our results show that HRBF-NN model is a very flexible procedure that can handle dimensionality reduction drastically without losing information in classification accuracy. In our example, we reduced the number of input variables from fifteen to five with even slightly better classification accuracy which is around 91%. As is well known, recently, kernel-based supervised classification techniques such as the support vector machines (SVMs) and multi-class SVMs have become popular. One problem that has not been addressed in the literature is that kernelization and supervised classification takes place in the high dimensional reproducing Hilbert kernel space (RHKS) and not in the original data space. The transformed kernel space mapping is not one-to-one and onto, and not invertible to the original data space due to the dot product operations in using the kernel trick. This makes the practical interpretation of the results difficult even though one can get good classification error rates.

The new HRBF-NN approach proposed in this paper overcomes the difficulties encountered in the RHKS type supervised classification and provides us a flexible technique in the original data space that combines regression trees, regularized regression, and the genetic algorithm (GA) with radial basis function (RBF) neural networks (NN) along with information complexity ICOMP criterion as the fitness function to carry out both classification and at the same time subset selection of best predictors which discriminate between the classes.

Therefore, we believe our approach is a viable means of data mining and knowledge discovery via the HRBF-NN method.