1 Introduction

Recent years have seen an exponential growth of various datasets in terms of size. But, in contrast to that the time requirement to process them has become shorter as people need a quick answer to their query. With the advent of machine learning algorithms, this seemingly impossible task has become plausible to the researchers. But most of the real-world datasets contain many irrelevant or redundant information which impedes the processing ability of the machine learning algorithms. Moreover, these irrelevant pieces of information increase the processing time of the learning model. Due to these reasons, researchers have introduced certain pre-processing techniques to allow models to handle even the noisy data with ease [1]. Before feeding the datasets to any machine learning model, these pre-processing techniques refine the datasets based on some parameters in order to reduce the said redundant and/or irrelevant information present therein.

In this regard, it is to be noted that the interpretation of information may vary in different environments. For example, in pattern classification domain various features, used to represent the different patterns in the feature space, define the information set. One of the most used pre-processing techniques related to pattern classification is the Feature Selection (FS) [2] which is a method used to refine the datasets by keeping only the important and relevant features. In this way, FS helps in overcoming some major drawbacks generally found with the raw datasets. Due to the reduction of irrelevant features, it is observed that the machine learning model becomes more effective in terms of both performance (such as recognition or classification ability) and processing time as the dimension of the feature vector becomes less. There are many ways to perform FS. The most basic method is Blind Search (BS) [3] which considers each and every feature combination to decide the optimal subset of features. But this process has an exponential time complexity and hence is not a feasible solution if the number of features in a dataset is large. As an improvement over BS, researchers have introduced various heuristic algorithms [4, 5]. Such algorithms generate a random subset of features and try to improve the same over time to find an optimal solution. Hence, the time required to process the features decreases to a large extent. But heuristic approaches are more problem-specific [6] and often use greedy methods [7] to reach the optimal feature set. However, the more advanced version of FS includes meta-heuristic algorithms [8,9,10,11] which are generally problem independent in nature. They do not take advantage of any problem specific parameter. Population-based meta-heuristics, instead of a single subset, start to solve this FS problem by generating a set of subsets (called population). Each subset in this process is known as a candidate solution to the FS problem.

FS methods are broadly classified into three categories namely—filter methods [12,13,14], wrapper methods [8, 9, 15] and embedded methods [16,17,18]. Among these, filter methods do not consult with any learning algorithm to select the relevant features. They rather depend on the intrinsic or statistical properties of features. This reduces the time requirement of the FS process but as this works without the supervision of a learning algorithm, it is generally unable to produce good results. On the other hand, wrapper methods take the help of a learning algorithm to find out the optimal subset of features. Although wrapper methods need more time when compared to filter methods, they can generally produce good results more often. Embedded methods combine the advantages of both filter and wrapper methods. They consult both intrinsic properties of features and a learning algorithm to form the best candidate solutions. Even though it seems like an embedded method is the best approach among all, they have their own drawbacks. If the trade-off between filter and wrapper constituents of an embedded method is not established properly, it may bring adverse effects on the overall FS model. That’s why in this paper, we have proposed a novel wrapper-based FS framework based on a popular evolutionary algorithm called Genetic Algorithm (GA) [1, 19, 20] which can be used to solve any FS problem.

GA is a meta-heuristic algorithm that is conceptualized from Charles Darwin’s theory of natural evolution. This popular algorithm basically imitates the procedure of natural selection where the fittest individuals get chosen for reproduction of the offspring of the next generation. This implies that GA works exactly in the way child chromosomes are created from the parents’ chromosomes. During this reproduction procedure, GA applies two operations on the chromosomes namely - crossover and mutation. Through the formation of child chromosomes, GA passes the fitter genes onto the next generations. When applied to FS, chromosomes become the candidate solutions which undergo a series of crossover and mutation operations to build the optimal solution. The better solutions are passed to the next iteration and the weaker solutions are discarded. In this way, optimal solutions are obtained after some generations. It is to be noted that any FS/optimisation algorithm should take care of two important aspects of the feature space namely exploitation and exploration in order to generate the optimal solution. Here exploration means how thoroughly the whole search space is covered and exploitation implies the extent of refinement achieved in a particular portion of the search space. Though GA can achieve very good exploration through crossover operation, it lacks extensive exploitation. In general, mutation means some random change or perturbation of the chromosomes. Thus, mutation helps chromosomes of GA to achieve some local search (exploitation) in the search space. Mutation, however, is unable to achieve the full extent of exploitation needed in the algorithm. We have used the Great Deluge Algorithm (GDA) [21] to achieve this perturbation of the candidate solutions. As the name suggests, it follows an analogy that a person who is climbing a hill sometimes moves down in anticipation of reaching a new maximum height to avoid getting wet when the water level rises from the bottom. This very concept is used here to get a better optimal solution in FS. We can consider classification accuracies to be hills. The value of the accuracy represents the height of the hills and hill tops are the optimum values. More the height, more is the accuracy. Now when GA chromosomes reach a certain hill top, we can say that it gets to a local maximum but that maximum may not be the global one. So, to get to the hill with maximum height, the chromosomes must descend from the current hill top and go on searching around the neighbouring area. This job can be performed using GDA.

Deluge based Genetic Algorithm (DGA) focuses on solving FS problem using a modified version of GA which uses GDA as a scheme for modification. Though GA follows a very simple algorithm, it shows almost equivalent performance like other complex meta-heuristic like Ant Colony Optimisation (ACO) [8], Particle Swarm Optimisation (PSO) [9], Simulated Annealing (SA) [5] etc. This motivates us to design our FS model which is simple in terms of computation but improves the performance of basic GA.

2 Related work

After the introduction of FS to the research community, it has become a hugely accepted pre-processing technique for its ability to reduce redundancy and/or irrelevancy in datasets and problem independent nature. Researchers all over the world have come up with several new algorithms or modified some existing algorithms in the domain of FS. In this section, we have discussed how the usage of GA in FS has evolved and also a few applications of GA in real-world pattern classification problems. Applications of GDA to solve various real-world problems have also been explored.

GA is one of the primary optimisation algorithms employed to perform FS in the literature. Due to its simplicity, it has been widely accepted in many fields of research. Leardi et al. introduced GA in FS domain [22]. Since then many modified versions of GA have been employed in FS. In [15], Yang et al. modified the fitness function to make GA multi-objective. The fitness function consisted of classification accuracy of the feature subset and the cost of achieving that accuracy. Proper trade-offs were used based on the importance of the two parameters to form the solution. In [23], Huang et al. proposed a way to hybridise GA with Mutual Information (MI). Their method also used a combination of filter and wrapper methods to perform FS. They implemented a local search heuristic technique to get rid of the insignificant features selected by the initial random population. The chromosomes were ranked using a filter method but the final performance was evaluated in a wrapper fashion. A conditional MI was computed between the candidate feature and the already-selected features as well as the classes. During the local search process, the candidates were ranked according to the MI values. Classic GA is weak in fine-tuning near local optimum positions. In order to improve the fine-tuning capability of GA Oh et al. proposed a hybrid GA in [24]. After mutation stage of GA, it used a local searching procedure which explored the search space until chromosomes had the desired number of selected features. Siedlecki et al. in [25] introduced the use of GA for FS in automatic pattern classifier design. In [20], a histogram based GA to perform FS on Devanagari numeral datasets was proposed. After performing GA for a number of times their results were combined by drawing a histogram and selecting the features which were above the mean histogram value. Application of GA in microarray data can be seen in [19].

In [21], Dueck et al. proposed a new optimisation heuristic named as GDA. The experimental results showed that the algorithm was as good as Threshold Accepting (TA) [26] in optimisation. Baykasoglu et al. used GDA to solve constrained mechanical design problems in [27]. For the first time, they tried to modify the performance of GDA to solve complex nonlinear optimisation problems by embedding eight different chaotic maps in its neighbourhood. GDA was used multiple times to efficiently solve University Time-Tabling problems. In [28], Obit et al. used an extended version of non-linear GDA to perform University Time-Tabling. The algorithm incorporated tournament selection, mutation and a replacement strategy to modify GDA. McCollum et al. in [29] introduced a two-phase approach incorporating Extended-GD (EGD) and applied it on the datasets presented in 2nd International Timetabling Competition (ITC2007). Their technique consists of a construction phase followed by an improvement phase. EGD was used to improve the solutions as a part of the improvement phase. The results obtained by this method was compared to the results obtained by the winner of the competition (ITC2007). In 2011, Mafarja et al. applied modified GDA to perform attribute reduction [30]. The entire search space was divided into three regions. In each region, the water level was updated based on the quality of the current solution. The procedure achieved decent results when applied to 12 benchmark datasets. In the field of FS, GDA is used to improve local search of various state-of-the-art algorithms. This can be seen in [31] where Badawi et al. used a hybridised version of GA and GDA, Memetic Algorithm (MA), to perform fish classification.

Mutation causes little change in chromosomes since mutation probability is generally low. This leads to a poor exploitation capability in GA and the algorithm may get stuck in a local optimum. An attempt is made to overcome the said problem by looking into the neighbouring area of a candidate feature subset. GDA may allow acceptance of some solutions with lower fitness to reach a higher peak—solution with better fitness. From the literature survey done above, it can be concluded that although GA and GDA have been used separately to solve various real-world problems, but there has been very less research attempts to form a hybridisation of the two for application in FS. This motivates us to introduce a new approach to unite them. GDA replaces mutation which allows us to perform a more thorough search of the neighbourhood thereby improving exploitation capability of GA.

3 Proposed methodology

This section contains a detailed explanation of our proposed method called DGA. It has already been mentioned that GA is a widely used optimisation algorithm having a constrained exploitation ability. This affects accuracy by disallowing proper search near local optimums i.e. local search is lacking. GDA is a method which allows the solutions to degrade to a certain extent which ultimately helps to achieve greater local search. So, the basic idea is instead of restricting the exploitation of the method, we can perform a better local search to reach the global maximum. This is why we’ve used GDA to increase the local searching capability of GA.

3.1 Genetic Algorithm

GA [15] incorporates the concept of chromosome manipulation while the transfer of genetic information from parent chromosomes to child chromosomes. GA mainly consists of four key steps i.e. population creation, parent selection, crossover and mutation which resemble the biological process of chromosome formation. At first, GA creates a population of chromosomes which represent initial candidates to become optimal solutions. After selection of parent chromosomes from the population, they undergo crossover to form child chromosomes. These child chromosomes are then mutated which allows for the restoration of lost genes. After all these steps, the off-springs which survive pass on their genetic information to the next generation. GA follows the same procedure in FS. Each candidate solution in GA is represented by a chromosome. Chromosomes are basically vectors of ‘0’s and ‘1’s which depict that the corresponding feature is discarded or selected by that chromosome respectively. Each time from a population of chromosomes, two parent chromosomes are selected. They perform crossover to form two child chromosomes which then undergo mutation. These child chromosomes replace the parent chromosomes if they have better fitness values else their parents are carried over to the next generation.

3.2 GDA

In 1993, Dueck proposed a new local search technique known as GDA [21] which followed an approach also applicable in real life. The basic idea of GDA is that a person climbing a hill sometimes needs to descend to reach a new height in order to avoid getting wet in the rising rain water. The pseudo code of GDA is as follows:

figure a

GDA allows degradation of the quality of the candidate solutions to a certain extent in order to achieve a new maximum which was not attainable otherwise. In our proposed method, we have utilised this property of GDA to perform local search efficiently.

3.3 DGA

Our proposed method, DGA, combines GDA with GA in order to address the problems related to basic GA. As mentioned earlier, GA consists of four sub-process—population creation, parent selection, crossover and mutation. Mutation in GA is responsible to achieve local search in the system of solutions but in certain cases, solutions generated by GA suffers from the disadvantage of getting stuck in a local optimum. To avoid this problem, GDA is used to achieve stochastic perturbation of the child chromosomes produced by GA which helps them to circumvent the local maximum to reach the global maximum in the search space. The flowchart of the proposed model is shown in Fig. 1. The algorithm stops when the number of iterations exceed the maximum number of iterations allowed or no change occurs in successive iterations. The maximum number of iterations is taken as 20.

Fig. 1
figure 1

Flowchart describing the step by step progress of Deluge based Genetic Algorithm

Thus, DGA consists of five steps.

  1. 1.

    Population creation.

  2. 2.

    Parent selection.

  3. 3.

    Crossover.

  4. 4.

    GDA.

  5. 5.

    Child replacement.

The population creation is described in Sect. 3.3.1. The parent selection using roulette wheel is discussed in Sect. 3.3.2. The operations of crossover and GDA are described in Sects. 3.3.4 and 3.3.5 respectively. The replacement of parent chromosomes in the population by the children is described in Sect. 3.3.5.

3.3.1 Population creation

At first population of random chromosomes are created. Chromosomes are vectors of ‘0’s and ‘1’s where ‘1’ represents a feature which gets selected by it and ‘0’ means a feature that is not considered. Each such chromosome represents a feature subset. Thus, the chromosomes are integer vectors with elements in {0,1}. The main objective of the algorithm is to find the most optimal chromosome i.e. high accuracy using a low number of features. The size of population (number of chromosomes) is taken as 20.

3.3.2 Parent selection

After creating a population of chromosomes, GA searches for some pairs of parent chromosomes in order to perform crossover operation. We have used Roulette Wheel selection [32] method for selection of parent chromosomes. Each chromosome gets an amount of space on the roulette wheel proportional to its accuracy. A chromosome with higher accuracy gets larger space on the wheel. After placing the chromosomes on the wheel, it is rotated. The wheel has a random pointer which points at the selected chromosome when the wheel stops rotating. As better chromosomes get larger spaces, their chances of selection are more. This is how the roulette wheel selection method works. Using this procedure, two parent chromosomes are selected which then proceed for crossover operation.

3.3.3 Crossover

The selected parent chromosomes then undergo crossover. Uniform crossover [19] is done because of its relative superiority over single-point crossover [33]. With a probability of \(0.5\), we switch each bit between the two parents. In the end, the child chromosomes are a uniform mixture of the genetic information of the two parents.

3.3.4 GDA

The child chromosomes formed after crossover are then passed through GDA to achieve local search. Each child chromosome undergoes a series of perturbations until they become too weak to be discarded, else it keeps on searching locally around the space to obtain a better solution than its current solution. This significantly enhances the exploitation of search space.

3.3.5 Child replacement

Finally, if the child chromosomes surpass their parents in term of classification accuracy then they are placed in the population (replacing the parents) and carried over to the next generation, otherwise, the parents remain in the population. Note that, we have used the classification model as our fitness function.

4 Results and analysis

This section contains the results obtained by DGA over some well-known datasets. We have also compared these results with some state-of-the-art algorithms which further confirm the applicability of our proposed method. It should be noted that though computation cost of GDA is greater than GA, but it outperforms GA in terms of both accuracy and selection of lower number of features. This justifies the use of GDA over GA in real life applications. The description about the datasets used here is given in Sect. 4.1. The classifier details are provided in Sect. 4.2 and results analysis is reported in Sect. 4.3.

4.1 Dataset description

To test DGA, we have selected 15 datasets from the UCI repository [34]. These datasets can be classified into three categories based on the number of attributes or features present—small, medium and large. For our experimentation, we have taken 4 small, 3 large and 8 medium datasets. The descriptions of the datasets are provided in Table 1.

Table 1 Category-wise (small, medium, large) information on 15 UCI datasets used in the experimentation

4.2 Classifier description

As learning algorithms, we have used three popular classifiers namely K-Nearest Neighbour (KNN) [35], Multi-layer Perceptron (MLP) [36] and Support Vector Machine (SVM) [37] for finding classification accuracy of the candidate solutions generated by DGA. The accuracy of the chromosomes calculated using these classifiers are their fitness values.

These three classifiers are selected to show the performance of our proposed method in different classification environments of varying complexity. The results clearly show that our proposed method is independent of the classification model which makes DGA more platform independent.

We have varied parameters of the classifiers to get the most suitable evaluation environment for the candidate solutions. The best-achieved parameter combinations of the classifiers are provided in Table 2.

Table 2 The optimal values obtained for the important parameters (K for KNN, number of neurons for MLP, Kernel for SVM) of the classifier models used in the experimentation over different datasets

4.3 Analysis of the outcomes

This section contains the results obtained by DGA and its comparison with other well-known optimisation algorithms. Here, 4 algorithms selected for the comparison are basic GA [15], PSO [9], SA [5] and Histogram Oriented Multi-Objective Genetic Algorithm (HMOGA) [20]. The detailed experimentation results of the proposed work are depicted in Table 3. The best accuracies obtained are marked bold.

Table 3 Classification accuracy obtained after FS by DGA using 3 classifiers (KNN, MLP, SVM) over 15 datasets

KNN due to its low computation time has been used as our classifier for comparison. To keep uniformity in the process of comparison, we have also evaluated other algorithms (present in the comparison) using KNN classifier. Table 4 contains the comparative results. It can be observed from the table that in comparison to other methods DGA performs better in 9 cases out of 15. The results are considered best if the accuracy is the highest. If the accuracies are equal, the number of features acts as the tie-breaker (lower value is considered as better). From Table 4 it can be observed that for small datasets DGA performs better for only 1 dataset, while for medium DGA outperforms contemporary algorithms in 6 cases out of 8 datasets. For large datasets, DGA achieves the best accuracy for 2 out of 3 datasets. Therefore, it can be concluded that DGA performs better as the number of features (feature dimension) increases.

Table 4 Comparison of accuracy obtained by proposed FS model DGA with that of GA, PSO, SA and HMOGA using KNN classifier

From the results and the corresponding comparison, we can see that in 9 out of 15 datasets, DGA has outperformed other optimisation algorithms. This clearly proves the applicability of DGA in FS.

Figures 2, 3, 4, 5, 6 and 7 represent graphical comparisons of DGA with other methods with respect to a number of selected features as well as accuracy over small, medium and large datasets. Figures 2, 3, and 4 show the graphs representing the best accuracy obtained by different FS algorithms over small, medium and large datasets. Similarly, Figs. 5, 6, and 7 show the graphs representing the percentage of features selected for the best solution by the model over small, medium and large datasets.

Fig. 2
figure 2

Comparison of the classification accuracy obtained by DGA with that of GA, PSO and HMOGA over small category datasets

Fig. 3
figure 3

Comparison of the classification accuracy obtained by DGA with that of GA, PSO and HMOGA over medium category datasets

Fig. 4
figure 4

Comparison of the classification accuracy obtained by DGA with that of GA, PSO and HMOGA over large category datasets

Fig. 5
figure 5

Comparison of the percentage of features selected by DGA with that of GA, PSO and HMOGA over small category datasets

Fig. 6
figure 6

Comparison of the percentage of features selected by DGA with that of GA, PSO and HMOGA over medium category datasets

Fig. 7
figure 7

Comparison of the percentage of features selected by DGA with that of GA, PSO and HMOGA over large category datasets

5 Conclusion

GA is one of the most fundamental optimisation algorithms which has been applied to solve FS problem over the years. Its simple yet robust nature has made it one of the most popular FS techniques. However, the lack of exploitation for GA affects its population. That is why our motive is to build a hybridised model which uses GA but keeping it simple. So, we have chosen GDA to get rid of the disadvantages GA has and created a hybridised model called DGA. GDA’s local searching strategy helps GA to move around the local optimum and gradually reach global optimum. Thus, the overall model becomes more efficient in finding a global optimum than basic GA. As GDA is also a very simple algorithm, the complexity of the overall model does not increase too much. The comparison of results of DGA over the selected UCI datasets with that of other heuristic, meta-heuristic and hybridised models speaks for the applicability of our model. In the future, we plan to extend GDA’s local searching technique and use it for other methods which also suffer from the same exploitation problem as GA. The use of DGA for more application-based FS can also be explored.