1 Introduction

For the past two decades, there have been a lot of studies focused on the classification problem in the field of data mining [1, 2]. The general goal of data mining is to extract knowledge from large gamut of data, it is important to bear in mind some desirable properties of discovered knowledge. The discovered knowledge should be highly predictive and comprehensible. The relative importance of each of these properties, which can be considered as quality criteria to evaluate discovered knowledge, depends strongly on several factors, such as the kind of data mining task being solved, the application domain, and the user. However, since this work focuses on classification task of data mining, it is important that the discovered knowledge have a high predictive accuracy, even though in many cases, the comprehensibility tends to be more important than predictive accuracy.

Knowledge comprehensibility is usually important for at least two related reasons. First, the knowledge discovery process usually assumes that the discovered knowledge will be used for supporting a decision to be made by a human user. Second, if the discovered knowledge is not comprehensible to the user, he/she will not be able to validate it, hindering the interactive aspect of the knowledge discovery process, which includes knowledge validation and refinement. In addition to giving importance on predictive accuracy of the proposed method, an equal importance is also given to the comprehensibility, which is another considerable criterion. In this paper, we are measuring the comprehensibility of the proposed method by reducing the architectural complexity. As we know, the architectural complexity of functional link artificial neural network (FLANN) [3] is directly proportional to the number of features and the functions in hand for expansion of the given feature value. For reducing the architectural complexity, we first select a few subsets of features (i.e., feature selection [4]) and then applying the usual procedure of function expansion and training by back propagation learning. The steps from selection to learning are accomplished by hybridization of FLANN with genetic algorithms (GAs) [5] and, therefore, we named it as hybrid FLANN (HFLANN).

Traditional statistical classification procedures such as discriminant analysis are built on the Bayesian decision theory [6]. In these procedures, an underlying probability distribution must be assumed in order to calculate the posterior probability upon which the classification decision is made. One major limitation of the statistical models is that they work well only when the underlying assumptions are satisfied. The efficiency of these methods depends to a large extent on the various assumptions or conditions under which the models are developed. Users must have a good knowledge of both data properties and model capabilities before the models can be successfully applied.

Neural networks [7] have emerged as an important tool for classification. The recent vast research activities in neural classification have established that neural networks are a promising alternative to various conventional classification methods. The ANNs are capable of generating complex mapping between the input and the output space, and thus these networks can form arbitrarily complex nonlinear decision boundaries.

Pao et al. [8] have given a direction that their proposed FLANN may be conveniently used for function approximation and can be extended for pattern classification with faster convergence rate and lesser computational load than an multi-layer perceptron (MLP) structure. The FLANN is basically a flat network, and the need of the hidden layer is removed, and hence the learning algorithm used in this network becomes very simple. The functional expansion effectively increases the dimensionality of the input vector, and hence the hyper planes generated by the FLANN provide greater discrimination capability in the input pattern space. Although many types of neural networks can be used for classification purposes [7], we choose feed forward multi-layer networks or multi-layer perceptrons (MLPs) as a benchmark method for comparison. Even though it has a complex architecture and long training time, it is most widely studied and used neural network for classification. In addition, we used FLANN with gradient descent method for classification of our previous work [3] for comparison. Although FLANN with back propagation learning gives promising results but if the number of features and the functions to be used for expansion is large, then the complexity of the architecture increases, and hence it became less comprehensible. Another important point is no matter how intelligent the FLANN is, it will fail to predict the unknown sample if it is applied to low quality data. Hence, to improve the capability of the FLANN for accurate prediction and its comprehensibility, we hybridized with genetic algorithms (GAs). Therefore, we named it as HFLANN. This method not only has practical time complexity, but also achieves good performance.

Over the last years, the machine learning and data mining community has become increasingly alert of the need for statistical validation of the results. This can be ascribed to the maturity of the area, increasing the number of real-life applications and the availability of open algorithmic frameworks that make it easy to develop new algorithms or modify the existing, and compare them among themselves.

The rest of this paper is organized as follows. In Sect. 2, we have discussed the background materials very quickly. Section 3 provides our proposed HFLANN for classification. In Sect. 4, we have presented the experimental studies and a parametric and nonparametric statistical comparative performance with other classifiers like RBF and FLANN trained by back propagation learning. Section 5 concludes the article.

2 Background

In this section, we will discuss the basic background material required for a deep understanding of the proposed method. The section is divided into five subsections, namely, basic working principle of genetic algorithms, a formal model of functional link artificial neural network, the importance of feature selection, classification task of data mining, and the statistical test for comparison of classifiers.

2.1 Genetic algorithms

In this section, we review the function of genetic algorithms (GAs) [5]. GAs are stochastic search algorithms characterized by the fact that a number N of potential solutions (called individuals \( I_{k} \in \Upomega , \) where Ω represents the space of all possible individuals) of the optimization problem simultaneously sample the search space. This population \( P = \{ I_{1} ,I_{2} , \ldots ,I_{N} \} \) is modified according to the natural evolutionary process: after initialization, selection S:I NI N, and recombination Я:I NI N are executed in a loop until some termination criterion is reached. Each run of the loop is called a generation, and P(t) denotes the population at generation t.

The selection operator is intended to improve the average quality of the population by giving individuals of higher quality a higher probability to be copied into the next generation. Selection thereby focuses on the search of promising regions in the search space. The quality of an individual is measured by a fitness function f:P→R. Recombination and mutation change the genetic material in the population in order to obtain new points in the search space. Figure 1 depicts the steps that are performed in GA.

Fig. 1
figure 1

Flow diagram of genetic algorithms

2.2 Functional link artificial neural networks

In general, the models that we use to solve complex classification problems are multi-layer neural network. There are many algorithms to train the neural network models. However, the models being complex in nature, one single algorithm cannot be claimed as best for training to suit different scenarios of the complexities of real-life problems. Depending on the complexities of the problems, the number of layers and number of neurons in the hidden layer need to be changed. As the number of layers and the number of neurons in the hidden layer increases, training the model becomes further complex. Very often, different algorithms fail to train the model for a given problem set.

To overcome the complexities associated with multi-layer neural network, a single-layer neural network can be considered as an alternative approach. But the single-layer neural network being linear in nature very often fails to map the complex nonlinear problems. The classification task in data mining is highly nonlinear in nature. Therefore, for solving such problems in single-layer feed forward artificial neural network is almost an impossible task.

In order to bridge the gap between the linearity in the single-layer neural network and the highly complex and computationally intensive multi-layer neural network, the FLANN architecture with back propagation learning for classifications is suggested [3]. The FLANN architecture uses a single-layer feed forward neural network to overcome the linear mapping, functionally expands the input vector. Figure 2 shows the simple architecture of our previously proposed FLANN with gradient descent.

Fig. 2
figure 2

a functional expansion of the input feature, b FLANN model for classification, c block representation of a

The given set of patterns is fed to the input layer and is expanded in hidden layer, and then the weighted sum is fed to the single neuron of the output layer. The weights are optimized by the back propagation learning during the process of training.

The set of functions considered for function expansion may not be always suitable for mapping the nonlinearity of the complex task. In such cases, few more functions may be incorporated to the set of functions considered for expansion of the input dataset. However, dimensionality of many problems itself are very high and further increasing the dimensionality to a very large extent may not be an appropriate choice. So, it is advisable and also a new research direction to choose a small set of alternative functions, which can map the function to the desired extent with an output of significant improvement.

2.3 Feature selection

Feature selection is one of the very important preprocessing tasks of data mining and knowledge discovery in databases. It is obvious that the quality of discovered knowledge strongly depends on the quality of data being mined. No matter how intelligent a data mining algorithm is, it will fail to discover high quality knowledge if it is applied to low quality data. This has motivated the development of several feature selection algorithms. The main goal of the feature selection is to select a subset of relevant feature out of all available features of the data being mined.

In general, feature selection can be visualized as the selection of a subset of features that will reduce the probability of misrecognition in the operational (classification) phase. A feature selection scheme based on the availability of a set of labeled samples from each of the predefined set of classes is referred to as feature selection in a supervised environment. But in practice, one often comes across situations where the samples are unlabeled or at best imperfectly labeled. Again, feature selection is very important for machine learning due to its potential of speeding up and reducing the costs of the followed stage of concept learning or instance classification and improving the performance of the learned results. Therefore, how to select the optimal feature subset to describe a learning system is always regarded as a key technology in the domain of machine learning.

Furthermore, among the different categories of feature selection algorithms, the genetic algorithm (GA) [5] is a rather recent development. GA-based feature selection is very essential because of the following reasons. Suppose there are ‘m’ numbers of features in the data being mined. Then, the total number of candidate feature subsets is 2m, which is the size of search space of the feature selection grows exponentially with the number of features.

The GA is biologically inspired and has many mechanisms mimicking natural evaluation. It has a great deal of potential in scientific and engineering optimization on search problems. The pioneering work by Siedlecki and Sklansky [9] demonstrated evidence for the superiority of GA compared to representative classical algorithms. Subsequently, many literatures were published that have shown advantages of GAs for feature selection. Other heuristic techniques like genetic programing [10] and PSO [11] are also used synergistically for optimizing both the features and classification accuracy, but in this work, we hybridized GA with FLANN to obtain a near optimal set of features with high classification accuracy.

2.4 Classification

The digital revolution has made digitized information easy to capture and fairly inexpensive to store. With the development of computer hardware and software and the rapid computerization of business, huge amount of data have been collected and stored in databases. The rate at which such data stored is growing at a phenomenal rate. As a result, traditional ad-hoc mixtures of statistical techniques and data management tools are no longer adequate for analyzing this vast collection of data.

Raw data is rarely of direct benefit. Its true value is predicated on the ability to extract information useful for decision support or exploration and understanding the phenomenon governing the data source. In most domains, data analysis was traditionally a manual process. One or more analysts would become intimately familiar with the data and, with the help of statistical techniques, provide summaries and generate reports. In effect, the analyst acted as a sophisticated query processor. However, such an approach rapidly breaks down as the size of data grows and the number of dimensions increases. When the scale of data manipulation, exploration and inferencing goes beyond human capacities, people look to computing technologies for automating the process.

All these have prompted the need for intelligent data analysis methodologies, which could discover useful knowledge from data. The term KDD [2] refers to the overall process of knowledge discovery in databases. Data mining is a particular step in this process, involving the application of specific algorithms for extracting patterns (models) from data. Supervised pattern classification is one of the important tasks of data mining.

Supervised pattern classification can be viewed as a problem of generating appropriate class boundaries, which can successfully distinguish the various classes in the feature space. In real-life problems, the boundaries between different classes are usually nonlinear. It is known that using a number of hyperplanes, one can approximate any nonlinear surface. Hence, the problem of classification can be viewed as searching for a number of linear surfaces that can appropriately model the class boundaries while providing minimum number of misclassified data points.

The goal of pattern classification is to assign input patterns to one of a finite number, M, of classes. In the following, it will be assumed that input patterns consist of static input vectors x containing X elements or continuous valued real numbers denoted x1, x2, …, x X. Elements represent measurements of features selected to be useful for distinguishing between classes. Input patterns can be viewed as points in the multidimensional space defined by the input feature measurements. The purpose of a pattern classifier is to partition this multidimensional space into decision regions to indicate which class an input belongs to.

Application of a pattern classifier first requires selection of features that must be tailored separately for each problem domain. Features should contain information required to distinguish between classes, be insensitive to irrelevant variability in the input, and also be limited in number to permit efficient computation of discriminant functions and to limit the amount of training data required. Good classification performance requires selection of effective features and also selection of a classifier that can make good use of those features with limited training data, memory, and computing power. Following feature selection, classifier development requires collection of training and test data, and separate training and test or use phases. During the training phase, a limited amount of training data and a priori knowledge concerning the problem domain is used to adjust parameters and/or to learn the structure of the classifier. During the test phase, the classifier designed from the training phase is evaluated on new test data by providing a classification decision for each input pattern. Classifier parameters and/or structure may then be adapted to take advantage of new training data or to compensate for nonstationary inputs, variation in internal components, or internal faults. Further evaluations require new test data.

It is important to note that test data should never be used to estimate classifier parameters or to determine classifier structure. This will produce an overly optimistic estimate of the real error rate. Test data must be independent data that is only used to assess the generalization of a classifier, defined as the error rate on never-before-seen input patterns. One or more uses of test data, to select the best performing classifier or the appropriate structure of one type of classifier, invalidate the use of that data to measure generalization.

2.5 Statistical test for comparison of classifiers

One of the goals of this paper is the study of the statistical tests that could be used for comparing two or more classifiers on multiple datasets. Assume that we have tested k different algorithms on N datasets. Let \( P_{i}^{j} \), \( 1 \le i \le N,\;1 \le j \le k \) be the performance score of the jth algorithm on ith dataset. The task is to decide whether, based on the values of \( P_{i}^{j} \), the algorithms are statistically different or not (i.e., whether HFLANN statistically different from FLANN and RBF or not).

In this section, we shall examine several known and less-known statistical tests such as paired t-test, Wilcoxon signed ranks test [12] and study their suitability for our purpose from the point of what they actually measure and of their safety regarding the assumptions they make about the data.

2.5.1 Paired t-test

A common way to test whether the difference between two classifiers results over various datasets is nonrandom to compute a paired t-test, which checks whether the average difference in their performance over the datasets is significantly different from zero.

Let \( P_{i}^{1} \) and \( P_{i}^{2} \) be the performance scores of two classifiers on ith out of N datasets. The paired t-test is computed as follows: construct the null hypothesis and follow the following steps.

  1. 1.

    Calculate the mean difference between two classifiers over all the datasets i.e., \( \bar{P} = \bar{P}^{1} - \bar{P}^{2} \), where \( \bar{P}^{1} = \sum\nolimits_{i = 1}^{N} {P_{{_{i}^{1} }} } \) and \( \bar{P}^{2} = \sum\nolimits_{i = 1}^{N} {P_{{_{i}^{1} }} } \).

  2. 2.

    Calculate \( \sigma_{1}^{2} \) and \( \sigma_{2}^{2} \), where \( \sigma_{1}^{2} = \sqrt {{\frac{1}{N}}\sum\nolimits_{i = 1}^{N} {\left( {P_{i}^{1} - \bar{P}^{1} } \right)}^{2} } \) and \( \sigma_{2}^{2} = \sqrt {{\frac{1}{N}}\sum\nolimits_{i = 1}^{N} {\left( {P_{i}^{2} - \bar{P}^{2} } \right)}^{2} } \)

  3. 3.

    Calculate the variance of the difference between the two means as follows \( \sigma_{p} = \sqrt {{\frac{1}{N}}\left( {\sigma_{1}^{2} + \sigma_{2}^{2} } \right)} \).

  4. 4.

    Calculate the required t-value, \( t\_value = {\tfrac{{\bar{P}}}{{\sigma_{P} }}} \).

Enter the t-table (2 N − 1) degrees of freedom; choose the level of significance required (normally p = 0.05), and read the t-value. Then, the decision is whether the null hypothesis is accepted or rejected based on the tests statistics support to the null hypothesis.

2.5.2 Wilcoxon signed ranks test

The Wilcoxon signed ranks test [12] is a nonparametric alternative to the paired t-test, which ranks the differences in performances of two classifiers for each dataset, ignoring the signs and compares the ranks for the positive and the negative differences.

Let \( P_{i} \) be the difference between the performance scores of the two classifiers on ith out of N datasets. The differences are ranked according to their absolute values; average ranks are assigned in case of ties. Let \( R_{\text{pos}} \) be the sum of ranks for the datasets on which the second algorithm outperformed the first, and \( R_{\text{neg}} \) be the sum of ranks for the first algorithm outperformed the second. Ranks of \( P_{i} = 0 \) are split evenly among the sums; if there are an odd number of them, one is ignored:

$$ R_{\text{pos}} = \sum\limits_{{P_{i} > 0}} {{\text{rank}}(Pi)} + {\frac{1}{2}}\sum\limits_{{P_{i} = 0}} {{\text{rank}}(Pi)} $$
(1)
$$ R_{\text{neg}} = \sum\limits_{{P_{i} < 0}} {{\text{rank}}(P_{i} )} + {\frac{1}{2}}\sum\limits_{{P_{i} = 0}} {{\text{rank}}(P_{i} )} $$
(2)

Let \( R_{\text{s}} \) be the smaller of the sums, \( R_{\text{s}} = \min \left\{ {R_{\text{pos}} ,R_{\text{neg}} } \right\} \). For a large number of datasets, the statistics \( z = \left( {{{R_{s} - {\tfrac{1}{4}}(N(N + 1))} \mathord{\left/ {\vphantom {{R_{s} - {\tfrac{1}{4}}(N(N + 1))} {\sqrt {{\tfrac{1}{24}}(N(N + 1)(2N + 1))} }}} \right. \kern-\nulldelimiterspace} {\sqrt {{\tfrac{1}{24}}(N(N + 1)(2N + 1))} }}} \right) \) is distributed approximately normally. With \( \alpha = 0.05 \), the null hypothesis can be rejected if z is smaller than −1.96.

The Wilcoxon signed ranks test is more sensible than the t-test. It assumes commensurability of differences, but only quantitatively; greater differences still count more, which is probably desired, but the absolute magnitudes are ignored. From the statistical point of view, the test is safer since it does not assume Gaussian distributions. Also, the outliers have less effect on the Wilcoxon than on the t-test.

The Wilcoxon test assumes continuous differences \( P_{i} \); therefore, they should not be rounded to, say, one or two decimals since this would decrease the power of the test due to a high number of ties.

When the assumptions of paired t-test are met, the Wilcoxon signed ranks test is less powerful than the paired t-test. On the other hand, when the assumptions are violated, the Wilcoxon test can be even more powerful than the t-test.

3 Proposed method

The proposed HFLANN is a single hidden-layer artificial neural network (ANN) with a genetically optimized set of features. It has the capability of generating complex decision regions by nonlinear enhancement of hidden nodes referred to as functional links. Figure 3 shows the topological structure of the HFLANN. The proposed method is characterized by a set of FLANN with a different subset of features. The initial input of the network is same as the number of input variables of the data domain.

Fig. 3
figure 3

Topological structure of the HFLANN

Let n be the number of original features of the data domain. The number of features selected to become a chromosome of the genetic population is m \( d \le n \). The m varies from chromosomes to chromosomes of the genetic population (i.e., \( 1 \le d \le n \)). For simplicity, let us see how a single chromosome with d features is working cooperatively for HFLANN.

In this work, we have used the general trigonometric function for mapping the d feature from one form to another form of higher dimension. However, one can use a function that is very close to the underlying distribution of the data, but it requires some prior domain knowledge. In this work, we are taking five functions out of which four are trigonometric and one is linear (i.e., keeping the original form of the feature value). Out of the four trigonometric functions, two are sine and two are cosine functions. In the case of trigonometric functions, the domain is feature values and range is a real number lies between [−1,1]. It can be written as

$$ f:D \to R^{{[ - 1,1] \cup \{ x\} }} $$
(3)

where \( D = \{ x_{i1} ,x_{i2} , \ldots ,x_{id} \} \), and d is known as the number of features.

In general, let us take \( f_{1} ,f_{2} , \ldots ,f_{k} \) be the number of functions used to expand each feature value of the pattern. Therefore, each input pattern can now be expressed as

$$ \vec{x}_{i} = \{ x_{i1} ,x_{i2} , \ldots ,x_{id} \} \to \{ \{ f_{1} (x_{i1} ),f_{2} (x_{i1} ), \ldots ,f_{k} (x_{i1} )\} , \ldots ,\{ f_{1} (x_{id} ),f_{2} (x_{id} ), \ldots ,f_{k} (x_{id} )\} \} = \{ \{ y_{11} ,y_{21} , \ldots ,y_{k1} \} , \ldots ,\{ y_{1d} ,y_{2d} , \ldots ,y_{kd} \} \} , $$
(4)

The weight vector between hidden layer and output layer is multiplied with the resultant sets of nonlinear outputs and are fed to the output neuron as an input. Hence, the weighted sum is computed as follows:

$$ s = \sum\limits_{j = 1}^{m} {y_{ij} .w_{j} } ,\quad i = 1,2, \ldots ,X\,{\text{and}}\,{\text{m}}\,{\text{be}}\,{\text{the}}\,{\text{total}}\,{\text{number}}\,{\text{of}}\,{\text{expanded}}\,{\text{features}} $$
(5)

The network has the ability to learn through back propagation learning. The training requires a set of training data, i.e., a series of input and associated output vectors. During the training, the network is repeatedly presented with the training data and the weights adjusted by back propagation learning from time to time till the desired input–output mapping occurs.

Hence, the estimated output is computed by the following metric:

$$ \hat{y}_{i} (t) = f(s_{i} ),\quad i = 1, 2, \ldots ,X. $$

The error \( e_{i} (t) = y_{i} (t) - \hat{y}_{i} (t),\quad i = 1, 2, \ldots ,X \) be the error obtained from the ith pattern of the training set. Therefore, the error criterion function can be written as,

$$ E(t) = \sum\limits_{i = 1}^{X} {e_{i} (t)} $$
(6)

and our objective is to minimize this function by gradient decent approach until \( E \le \varepsilon \).

This process is repeated for each chromosomes of the GA, and then based on the performance, each chromosome will be assigned a fitness value. Using that fitness value, the usual process of GA is executed until some good topology with high predictive accuracy is achieved.

3.1 High level algorithms for HFLANN

The specification of the near optimal HFLANN architecture and related parameters can be obtained by both genetic algorithms and back propagation learning, as it is explained in the following. Evolutionary algorithms of genetic type are stochastic search and optimization methods. Principally, based on computational models of fundamental process, such as reproduction, recombination, and mutation. An algorithm of this type begins with a set (population) of estimates (genes) called individuals (chromosomes) appropriately encoded. Each one is evaluated for its fitness in solving the classification task of data mining. During each iteration (algorithm time-step), the most-fit individuals are allowed to make and bear offspring.

3.1.1 Individual representation

For the evolutionary process, the length of each particle is n (i.e., the upper bound of a feature vector). Figure 4 shows the structure of a chromosome that is used for design of HFLANN. Each cell of the chromosome contains binary value either 0 or 1. The cell value controls the activation (the value of 1 is assigned) or deactivation (the value of 0 is assigned) of the functional expansion for individuals.

Fig. 4
figure 4

Individual representation with its associated FLANN topology

3.1.2 Objective function

During evolution, each individual measures its effectiveness by the error criterion function using Eq. 6, and the predictive accuracy is assigned as it corresponding fitness.

The major steps of HFLANN can be described as follows:

If we look very closely, this algorithm is not only selecting the optimal set of features, but also evolving a set of FLANN architecture. Therefore, we can say this is a type of evolving FLANN. However, in this work, we are not taking into account of optimizing the architecture from all aspects (i.e., topological structure as well as weights). Hence, instead of a multi-objective function optimization, we are only optimizing the uni-objective, i.e., known as predictive accuracy of the HFLANN.

4 Experimental studies

The performance of the EFLANN model was evaluated using a set of five public domain datasets like IRIS, WINE, PIMA, BUPA Liver Disorders, ECOLI, GLASS, HOUSING, LED7, LYMPHA, and ZOO from the University of California at Irvine (UCI) machine learning repository [13]. In addition, we have taken VOWEL dataset for show the performance of HFLANN to classify six overlapping vowel classes [14]. We have compared the results of HFLANN with other competing classification methods such as radial basis function network (RBF) and our previously proposed FLANN with gradient descent.

This section is divided into three subsections. Section 4.1 discusses the nature and characteristics of the dataset being classified. Section 4.2 discusses the parameter set up required for the experiment. The comparative performance of the model is demonstrated in Sect. 4.3 with a discussion. The classification performance of HFLANN with chromosome knowledge incorporation is presented in Sect. 4.4. Finally, the statistical tests are analyzed theoretically in Sect. 4.5.

4.1 Description of the datasets

Let us briefly discuss the datasets, which we have taken for our experimental setup.

IRIS Dataset: This is the most popular and simple classification dataset based on multivariate characteristics of a plant species (length and thickness of its petal and sepal) divided into three distinct classes (Iris Setosa, Iris Versicolor and Iris Virginica) of 50 instances each. One class is linearly separable from other two; the later are not linearly separable from each other. In a nutshell, it has 150 instances and 5 attributes. Out of 5 attributes, four attributes are predicting attributes and one is goal attribute. All the predicting attributes are real values.

WINE Dataset: These dataset are resulted from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. In classification context, this is a well-posed problem with well-behaved class structures. The total number of instances is 178, and it is distributed as 59 for class 1, 71 for class 2 and 48 for class 3. The number of attributes is 14 including class attribute, and all 13 are continuous in nature. There are no missing attribute values in this dataset.

PIMA Indians Diabetes Database: This database is a collection of all female patients of at least 21 years of age of PIMA Indian heritage. It contains 768 instances, 2 classes of positive and negative and 9 attributes including the class attribute. The attribute contains either integer or real values. There are no missing attribute values in the dataset.

BUPA Liver Disorders: This dataset related to the diagnosis of liver disorders and created by BUPA Medical Research, Ltd. It consists of 345 records, 7 attributes including the class attribute. The class attribute is repeated with only two class values for entire database. The first 5 attributes are all blood tests, which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each record corresponds to a single male individual.

ECOLI: This dataset describes about the protein localization sites. It contains 336 instances, 7 predictive attributes with no missing values and one class attribute. The samples are distributed into 8 classes, and the class distribution is highly unbalanced.

GLASS: The glass identification dataset contains 214 instances and 11 attributes (including an Id#) plus the class attribute (whose domain contains 6 values). All the attribute values are continuous, and no one contain missing values.

VOWEL: This dataset consists of 871 patterns with 6 overlapping vowel classes and three input features. All entries are integers.

HOUSING: The Boston housing data concerns housing values in suburbs of Boston. There are 506 samples and 13 continuous attributes (including class attributes) and 1 binary valued attribute with no missing values.

LED7: The LED display dataset contain 7 attributes and user chooses the number of instances. No attribute contains missing values.

LYMPHOGRAPHY: This dataset contains 148 instances and 19 attributes (including the class attribute) with no missing values. The classes are distributed into 4 classes.

ZOO: This dataset contains 101 instances of zoo information. The number of attributes is 18 (animal name, 15 Boolean attribute, and 2 numeric attributes). There is no missing value in the domain of the attributes, and it contains 7 types of zoo.

Table 1 presents a summary of the main characteristics of the databases that have been used in this study. The second column of this table gives the dataset name, while other columns indicate, respectively, the number of instances, the number of attributes, and number of classes.

Table 1 Summary of the dataset used in simulation studies

4.2 Parameter setup

For evaluating the proposed algorithm, the following user defined parameters and protocols related to the dataset need to be set beforehand.

A twofold cross validation is carried out for all the dataset by randomly dividing the dataset into two parts (datasets1. dat and dataset2.dat). Each of these two sets was alternatively used either as a training set or test set.

The quality of each individual is measured by the predictive performance obtained during training. It is also very important to set the optimal values of the following parameters to reduce the local optimality. The parameters are described as follows:

Population size: The size of the population denoted as |P| = 50 is fixed for all the datasets. We have chosen 50 to avoid under and over fit during the training. The larger the number of individuals, the more number of computation time is required, and the performance of the system will slow down.

Stop Criteria: The iteration is fixed to 1,000 for all the datasets.

Length of the individuals is fixed to n, where n is the number of input features. The probability for crossover is 0.7 and mutation is 0.02.

4.3 Comparative performance

The predictive performance obtained from HFLANN for the earlier mentioned datasets was compared with the results obtained from FLANN with back propagation learning and radial basis function network (RBF). Table 2 summarizes the average training and test performances of HFLANN and compared with FLANN and RBF.

Table 2 Average comparative performance of HFLANN, FLANN, and RBF

From Table 2, we can easily verified that except BUPA case in all other dataset on an average, the proposed method is giving promising results in both training and test cases. In the case of BUPA, FLANN is performing better. Table 3 illustrates a fair comparative performance of the proposed algorithm by using maximum predicative value obtained in training and test set.

Table 3 Comparative performance w.r.t maximum training performance

Table 4 shows the percentage of relevant feature selected for each of the datasets during training of HFLANN.

Table 4 Percentage of feature selected

Figure 5 shows a graphical view of the percentage of feature selected. The X-axis represents the datasets, and Y-axis represents the percentage of active bits in the optimal chromosome obtained during the training.

Fig. 5
figure 5

Percentage of optimal set of selected features

4.4 Knowledge incorporation in measure of the predictive accuracy

Let n be the total number of features in the dataset; T 1 denote the number of feature selected using the training set 1 and testing set 2; T 2 denote the number of feature selected using training set 2 and testing set 1.

Notations and their Meaning:

  • |n| represent the total number of features in the dataset.

  • |T 1| denote the total number of selected features in test set 2.

  • |T 2| denote the total number of selected features in test set 1.

The fitness of the chromosome with respect to T 1 is

$$ f(T_{1} ) = {\tfrac{{|PA| \times |n| - \tau \times |T_{1} |}}{|n|}} $$
(7)

Similarly the fitness of the chromosome with respect to T 2 is

$$ f(T_{2} ) = {\tfrac{{|PA| \times |n| - \tau \times |T_{2} |}}{|n|}} $$
(8)

In general, we write

$$ f(T_{{}} ) = |PA| - {\tfrac{\tau \times |T|}{|n|}} $$
(9)

where |PA| represent the predictive accuracy, and τ represent the tradeoff between two criteria, and its value is 0.01.

Table 5 shows the predictive accuracy using Eqs. 7 and 8 of the HFLANN by incorporating a kind of knowledge of each chromosome optimally selected with respect to test set 1 and test set 2.

Table 5 Predictive accuracy of HFLANN by knowledge incorporation with τ = 0.01

Table 6 shows the performance of hit percentage in training set 1 and training set 2 and its corresponding individual with active number of bits. From this table, one can take the conclusion that whether the explicit knowledge incorporation will be important in classifier or not.

Table 6 Performance in training set 1 and training set 2

Figure 6 shows the predictive performance of HFLANN by incorporating individual knowledge with \( \tau = 0.01 \) for training set 1 and training set 2 with respect to their corresponding active bits of the individual.

Fig. 6
figure 6

HFLANN performance on training set 1 and training set 2 with respect to the individual active bits

4.5 T-Test and Wilcoxon signed ranks test paired t-test

We have tested the proposed method (HFLANN) with FLANN and RBF using the t-test individually for training and testing performance scores. In order to test the significance of our algorithm over to FLANN and RBF, let us first construct the null hypothesis. The null hypothesis is that there is no difference between the average performance of HFLANN versus FLANN and HFLANN versus RBF.

4.5.1 HFLANN versus FLANN

Null hypothesis: means are equal.

t_value = 0.3833 with degree of freedom is 20.

The chosen level of significance is 0.05, and the tabulated value is 2.09. As the calculated t_value is less than the tabulated value, we reject the null hypothesis i.e., the proposed algorithm is significantly better than FLANN.

4.5.2 HFLANN versus RBF

Null hypothesis: means are equal.

t_value=1.4751 with degree of freedom is 20.

The chosen level of significance is 0.05, and the tabulated value is 2.09. As the calculated t_value is less than the tabulated value, we reject the null hypothesis i.e., the proposed algorithm is significantly better than RBF.

4.5.3 Wilcoxon signed ranks test

Like paired t-test, we will test the proposed method HFLANN with FLANN and RBF separately because it can compare two algorithms at a time over multiple datasets. Here, we are trying to reject the null hypothesis that both algorithms perform equally well. The ranks are assigned from the lowest to the highest absolute difference, and the equal differences are assigned average ranks. Tables 7 and 8 show the classification performance of HFLANN versus FLANN and HFLANN versus RBF and their corresponding ranks considering the training set.

Table 7 Performance score and ranks of HFLANN versus FLANN
Table 8 Performance score and ranks of HFLANN versus RBF

The sum of the ranks for positive difference is \( R_{\text{pos}} = 65 \) and the sum of the ranks for the negative difference is \( R_{\text{neg}} = 1 \). According to the table of exact critical values for the Wilcoxon’s test, for a confidence level of \( \alpha = 0.05 \) and N = 11 datasets, the difference between the classifiers is significant if the smaller of the sums is equal or less than 11. We, therefore, reject the null hypothesis.

The sum of the ranks for positive difference is \( R_{\text{pos}} = 66 \) and the sum of the ranks for the negative difference is \( R_{\text{neg}} = 0 \). According to the table of exact critical values for the Wilcoxon’s test, for a confidence level of \( \alpha = 0.05 \) and N = 11 datasets, the difference between the classifiers is significant if the smaller of the sums is equal or less than 11. We, therefore, reject the null hypothesis.

Hence, we can conclude that in both the cases, the proposed algorithm is significantly different from 0.

5 Conclusions

In this paper, we have evaluated the proposed method HFLANN for the task of classification in data mining by giving an equal importance to the selection of optimal set of features and classification accuracy. The HFLANN model functionally maps the selected set of feature value from lower to higher dimension. The experimental studies demonstrated that the classification performance of HFLANN model is promising. In almost all cases, the results obtained with the HFLANN proved to be better than the best results found by its competitor like RBF and FLANN with back propagation learning. Further, we theoretically and empirically analyzed parametric (t-test) and nonparametric (Wilcoxon signed rank test) tests that can be used for comparing classifiers over multiple datasets. The architectural complexity is low, whereas training time is little bit costly as compared to FLANN. As we know, one of the most important criteria of data mining is, how comprehensible the model is? If the architectural complexity increases, then the comprehensibility decreases. Therefore, from this aspect, we can claim that the proposed model can fit in data mining task of classification.