1 Introduction

Feature selection is a challenging task that has been the subject of a large amount of research, especially in relation to classification tasks. It permits to eliminate the redundant attributes and enhance the classification accuracy by keeping only the relevant attributes. However, the feature selection is a difficult task that needs an efficient algorithm to carry out the search for the optimum set of attributes. For example, for a data set with \(n\) attributes, there are \(2^{n}\) possible subsets to handle. Therefore, an exhaustive search for an optimal set of attributes would be time-consuming and computationally expensive when the number of attributes \(n\) is large.

The feature selection is an operation that has been applied to enhance the classification performance and to reduce data noise [2, 17, 20, 23]. The classifier can be improved by eliminating the redundant attributes and by keeping only the relevant attributes which can reduce the size of the dataset and simplify the data analysis. Several methods have been studied for feature selection. Among them, we cite the stepwise forward selection and stepwise backward elimination techniques [11], the best-first search [21] and the stochastic local search [27, 28]. The genetic algorithms have been already tested in the field of feature and attribute selection [2, 22, 27, 30, 31].

Memetic algorithms (MAs), first proposed by Moscato [25, 26], have been regarded as a promising framework that combines the evolutionary algorithms (EAs) with local search methods (LS). The evolutionary algorithm is used to ensure exploration while the local search method performs exploitation [33].

In this work, by combining genetic algorithm (GA) and stochastic local search (SLS), an efficient memetic algorithm (MA) is proposed to select an optimal set of attributes that are sent to the support vector machine based classifier. The latter is used to find the good classification for the considered data. In the proposed MA method, GA is responsible for exploration of the search space and the detection of the potential regions with good solution, while the stochastic local search (SLS) in this case, is used to produce an effective exploitation on the potential regions obtained by GA.

More precisely, the proposed MA searches for the best attribute set by applying the principles of an evolutionary process. The support vector machines (SVM) then classifies the new data in the reduced datasets, corresponding to the attribute subsets represented by the MA chromosomes. We note that the support vector machine (SVM) based classifier is a technique that permits to find an optimal separating hyper-plane. It uses a linear hyper-plane to create a classifier with a maximum margin [20]. The algorithm aims to find support vectors and their corresponding coefficient to construct an optimal separating surface by the use of kernel functions in high dimensional feature space [34].

In this study, we develop two versions of our hybrid method that combines MA and SVM. In the first version \((MA+SVM^{default})\), we use default parameters for SVM while in the second (denoted \(MA+SVM^{optimized})\), we use optimized parameters for SVM. We note that, in addition to the feature selection, \((MA+SVM^{optimized})\) optimizes the parameters that include penalty parameter \(C\) and the gamma parameter (\(g\)) for the radial basis function (RBF) kernel.

The performance of the proposed method MA combined with SVM is evaluated on some benchmark datasets of various sizes available in UCI Machine Learning Repository. The performance between the two methods \(MA+SVM^{default}\) and \(MA+SVM^{optimized}\) is also compared.

The rest of this paper is organized as follows. Section 2 presents an overview of the support vector machine (SVMs) and the memetic algorithm (MA). Section 3 presents the proposed approach MA + SVM for feature selection and supervised classification. The results with discussions are reported in Sect. 4. Finally, we conclude this study in Sect. 5.

2 Background

In this section, we give an overview of the support vector machine (SVM) and the memetic algorithm (MA).

2.1 Support vector machine

Machine learning is a well-known technique used in classification. Machine learning can be divided into two main categories: supervised learning and unsupervised learning.

  • The supervised learning is a machine learning technique that finds the best described computer model from a database with the correct class variable. The class variable is the variable of a database that the computer model needs to classify.

  • The unsupervised learning is a machine learning technique which uses a database without a class variable.

There are many different kinds of machine learning techniques such as: neural networks [18], decision trees [6], Bayesian networks [14] and support vector machines (SVM) [35].

In this work, we are interested in the support vector machine (SVM) which is a learning machine proposed by Vapnik et al. [34, 35]. SVM learn from a training data set and attempt to generalize and make correct predictions on novel data. For the training data we have a set of input vectors, denoted \(x_{i}\), with each input vector having a number of component features 4. These input vectors are paired with corresponding labels, which we denote \(y_{i}\), and there are \(l\) such pairs \((i=1,\ldots ,l)\).

Let us consider a binary classification task with data points \(x_{i}(i=1,\ldots ,l)\), having corresponding labels \(y_{i} =\pm 1\).

The data are linearly separable because many number of straight lines can separate the data points into two distinct classes where, in class \(1\), \(y=+1\) and in class \(2\), \(y=-1\). The best separating hyperplanes will be the one which have the maximal margin between them. The maximum margin hyperplane will be more accurate in classifying the future data tuples than the smaller margin.

The separating hyperplane can be written as in (1).

$$\begin{aligned} w \bullet x + b = 0 \end{aligned}$$
(1)

where \(\bullet \) denotes the inner or scalar product (so \(w\bullet x = w^{T} x\)). \(b\) is the bias or offset of the hyperplane from the origin in input space, \(x\) are points located within the hyperplane and the normal to the hyperplane, the weights \(w\), determine its orientation.

Let the decision function be (2):

$$\begin{aligned} f(x)= sign(w \bullet x +b) \end{aligned}$$
(2)

From the decision function we see that the data is correctly classified if \(y_{i}(w \bullet x_{i} + b) > 0\) \(\forall i\) since \((w \bullet x_{i}+b)\) should be positive when \(y_{i}=+1\), and it should be negative when \(y_{i} = -1\).

The maximal margin is denoted mathematically by the formula as in (3).

$$\begin{aligned} M = 2 / \Vert w \Vert \end{aligned}$$
(3)

where, \( \Vert w \Vert \) is the Euclidean norm of \(w\).

Maximizing the margin is therefore equivalent to minimizing:

$$\begin{aligned} \left\{ \begin{array}{ll} \min \frac{1}{2} w \bullet w \\ subject\,\, to\,\,the\,\,constraints\\ y_{i}(w \bullet x_{i}+ b)\ge 1 \quad \forall i \end{array} \right. \end{aligned}$$
(4)

In the following, we will make use of Lagrange multipliers, Karush–Kuhn–Tucker (KKT) conditions and duality and some familiarity with this subject will be assumed. The above formulation can be reduced to minimization of the following Lagrange function that consists of the sum of the objective function and the \(l\) constraints multiplied by their respective Lagrange multipliers [7, 16]. We call this the primal formulation:

$$\begin{aligned} \min _{w,b} L(w,b,\alpha )&= \min _{w,b}\left( \frac{1}{2}w\bullet w\right. \nonumber \\&\left. -\varSigma _{i=1}^{l}\alpha _{i}(y_{i}(w \bullet x_{i}+ b)-1)\right) \end{aligned}$$
(5)

where \(\alpha _{i}\) are Lagrange multipliers, and thus \(\alpha _{i} \ge 0\). Its dual formulation is then:

$$\begin{aligned} \max _{\alpha } \varPhi (\alpha )\!=\!\max _{\alpha }\left( \!\sum _{i}^{l}\alpha _{i} \!-\!\frac{1}{2}\sum _{i=1}^{l}\sum _{j=1}^{l}y_{i} y_{j} \alpha _{i}\alpha _{j} x_{i}\bullet x_{j}\!\right) \end{aligned}$$
(6)

subject to the constraints:

$$\begin{aligned} \left\{ \begin{array}{rl} \sum \nolimits _{i=1}^{l}\alpha _{i}y_{i}=0\\ \alpha _{i}\ge 0 \end{array} \right. \end{aligned}$$
(7)

Now we suppose that we are presented with a two-class dataset, and we do not know if the data is linearly separable. We can start with a linear kernel:

$$\begin{aligned} k\left( x_{i},x_{j}\right) = x_{i}\bullet x_{j} \end{aligned}$$
(8)

with no mapping to feature space. If the data is not linearly separable then we would not get zero training error if we attempt to. Though the data may not be separable in input space, it becomes separable in a higher dimensional space if we use a Gaussian kernel (radial basis function, RBF) instead [16]:

$$\begin{aligned} k(x_{i},x_{j}) = e^{\left( -\frac{\Vert x_{i}-x_{j}\Vert ^{2}}{2\gamma ^{2}}\right) } \end{aligned}$$
(9)

with \(\gamma > 0\)

The introduction of a kernel with its implied mapping to feature space is known as kernel substitution. Many other choices for a kernel function are possible [16].

For binary classification with a given choice of kernel, the learning task therefore involves maximization of:

$$\begin{aligned} \max _{\alpha } \varPhi (\alpha )=\max _{\alpha }\left( \sum _{i}^{l}\alpha _{i} -\frac{1}{2}\sum _{i=1}^{l}\sum _{j=1}^{l}y_{i} y_{j} \alpha _{i}\alpha _{j} k(x_{i},x_{j})\right) \nonumber \\ \end{aligned}$$
(10)

subject to the constraints (7).

From the optimal values of \(\alpha _{i}> 0\) , which we denote \(\alpha _{i}^{*}\). That means that the training set point \((x_{i}, y_{j})\) with a nonzero Lagrangian multiplier \(\alpha _{i}^{*}> 0\) lies on one of the two supporting hyperplanes, depending on its label \(y_{i}\). This point represents a constraint on the margin in that the supporting hyperplanes cannot be moved beyond it. We call points with nonzero Lagrangian multipliers support vectors. All other points have \(\alpha _{i}^{*}=0\), and the decision function is independent of these samples. If we pick a support vector from our training set, \(\left( x_{sv}^{+},y_{sv}^{+}\right) \) with \(y_{sv}^{+} = +1\), we can calculate the bias \(b\) from (11).

$$\begin{aligned} b^{*}=1-\sum _{i=1}^{l} \alpha _{i}^{*} y_{i} k\left( x_{i},x_{sv}^{+}\right) \end{aligned}$$
(11)

For a novel input vector z, the predicted class is then based on the decision function:

$$\begin{aligned} f(z)=sgn\left( \sum _{i=1}^{l}\alpha _{i}^{*}y_{i} k(x_{i},z)+b^{*}\right) \end{aligned}$$
(12)

2.2 The memetic algorithm

Memetic algorithms (MAs) [25, 26] can be viewed as an union of population-based global search and local based cultural evolution. MAs are inspired by Darwinian principles of natural evolution and Dawkins notion of a meme. The meme is defined as a unit of cultural evolution that performs local refinements. MAs are expected to make full use of the balance between exploration and exploitation of the search space to complement the advantages of population based methods and local based methods. MAs are affective methods that have revealed their successes with high performance and superior robustness across several problem domains [1, 3, 4, 10, 12, 15, 32, 37]. However, the hybridization can be more complex and expensive to implement. The main components that play an important role in MAs are:

  • The initial population: as in any optimization problem, a good starting point determines the speed of convergence to the optimum. In MA, the initial population is generated randomly like in genetic algorithms.

  • The Selection: the selection operator identifies statistically the best individuals of population and eliminates the worse ones.

  • The crossover operator: in an iterative process, with each step, two individuals (Parents) of the population are selected and recombined to obtain a new individual (child).

  • The local search operator: it is applied on the new individual to improve its quality.

  • The insertion condition: that determines whether the new individual (resulting child) is added to the population where it replaces an existing individual.

  • The stop condition: in general, the search process is stopped after a certain number of generations or when the limited time of execution is reached.

3 The proposed approach

In this work, we propose a memetic algorithm combined with support vector machine for supervised classification. The proposed memetic algorithm is a hybrid algorithm between the genetic algorithm and the stochastic local search. The goal of the proposed method MA + SVM is to optimize the accuracy of SVM classifier by detecting the subset of best features.

The MA + SVM method starts with a population of solutions generated randomly. Each solution represents a coded string of binary digits which denote the attributes present in the dataset. The length of the string is equal to \(n\), where \(n\) is the number of attributes.

A solution can be defined as a subset of selected attributes. To represent the selected attributes set, the following assignment is used: if a particular attribute is to be selected in a solution, the value 1 is assigned to it, conversely for an attribute that is not included in the selected subset, a value 0 is assigned to it. For example, if the first, the second, the third and the fourth attributes are selected in a dataset of six attributes, the solution would be represented as: \(`111100'\). After the initialization of the population, the algorithm iteratively generates offspring using GA and undergoes local refinement with SLS. The selected set of attributes is sent to the Support vector machine based classifier. The latter is used to find the good classification for the considered data.

The proposed approach is depicted in Fig.1. In the proposed memetic algorithm which is a combination between a genetic algorithm (GA) and a stochastic local search (SLS), GA is used for exploring the global search space of features, and SLS is deserved to play the role of local exploitation.

Fig. 1
figure 1

Flowchart of the proposed approach

3.1 The fitness function

The fitness function is defined by the classification accuracy, We used the 10-fold cross-validation which is the standard way of measuring the accuracy of a learning scheme on a particular dataset [17]. The data is divided randomly into ten parts in which the class is represented in approximately the same properties as in full dataset. During each run, one of partitions is chosen for the testing and the remaining nine-tenths are used for training. The procedure is repeated ten times where each partition is used for training exactly once. The classifier performance \(Accuracy\) is evaluated by calculating the ratio of number of correctly classified instances to the total number of instances by using the formula (13)

$$\begin{aligned} \left\{ \begin{array}{ll} Fitness=Accuracy\\ Accuracy = \frac{tp +tn}{tp+fn+fp+tn} \end{array} \right. \end{aligned}$$
(13)

where, \(tp\) is the true positive, \(tn\) is the true negative, \(fp\) is the false positive and \(fn\) is the false negative. The average rate of correct classification of all test scores corresponds to the prediction rate (average classification accuracy).

3.2 The memetic algorithm for feature selection

In the proposed MA + SVM method, GA is used to explore the search space while SLS is used as a refinement step. The selected features are sent to SVM. The SVM classifier is used to find the good classification for the considered data. The fitness value permits to measure the quality of the individuals. The fitness values correspond to the classification accuracies that are empirically derived from the SVM classifier based on the test data.

The main components that play an important role in our MA + SVM are:

  • The initial population: this population is generated randomly like in genetic algorithms.

  • The Selection operator: this is based on fitness. That is the best individuals of the population are selected and the worse ones are eliminated.

  • The crossover operator: the single point crossover is applied in our case. The two selected parent chromosomes are separated at a particular point that has been randomly selected and their adjacent substrings are interchanged. We obtain two new individuals (two children). In order to select the appropriate child for refinement in the next step, we studied three strategies.

    1. 1.

      The first strategy (S1): we choose the best child according to the fitness value.

    2. 2.

      The first strategy (S2): we choose the worse child according to the fitness value.

    3. 3.

      The first strategy (S3): with a fixed probability (\(pl\) = 0.5) we choose randomly one child.

    The three strategies are implemented and evaluated on some datasets in order to select the best one (see Sect. 4).

    • The stochastic local search method: it is applied on an individual to improve its quality.

    • The insertion condition: the resulting child is added to the population and the worse one is discarded.

    • The support vector machine SVM classifier that is used to find the good classification for the considered data.

    We evaluated two versions of SVM which are:

    1. 1.

      \(SVM^{default}\): is a support vector machine with default parameters.

    2. 2.

      \(SVM^{optimized}\): is a support vector machine with optimized parameters by using an iterative method.

  • The stop condition: the search process is stopped after a certain number of generations.

3.3 Refinement with stochastic local search

In the proposed method, the stochastic local search (SLS) is employed to conduct exploitation of the features selection solution space. SLS is a local search meta-heuristic which has been already studied for several optimization problem such as satisfiability and optimal winner determination problem (WDP) in combinatorial auctions [5, 27, 28].

The stochastic local search works on the best individual created by the genetic algorithm process. To avoid exploiting the same region simultaneously, once an individual is selected to be refined. Then, it performs a certain number of local steps that combines diversification and intensification strategies to locate a good solution.

  1. 1.

    Step 1: the diversification phase that consists in selecting a random neighbor individual.

  2. 2.

    Step 2: the intensification phase that consists in selecting the best neighbor individual having the best fitness.

The diversification phase is applied with a fixed probability \(wp>0\) and the intensification phase with a probability \((1-wp)\). The \(wp\) is probability fixed empirically.

Neighborhood solutions are generated by randomly adding or deleting a feature from the feature vector of size n. For example, if \(11001\) is the current feature vector, then the possible neighbors can be : \(01001\), \(10001\), \(11101\), \(11011\), \(11000\). Among the neighbors, the one with the best fitness (i.e. the solution which results in the maximum value of Eq. 13) is selected and considered as a new current solution for the next iteration. The process is repeated until a certain number of iterations called \(maxiter\) is reached, The SLS method is sketched in Algorithm 1.

figure a

3.4 The overall algorithm MA + SVM for features selection

The overall algorithm MA + SVM for classification is sketched in Algorithm 2.

figure b

4 The experiments

In this section, we present the experimental studies conducted to evaluate the performance of the proposed method for feature selection in data classification. We have used the LIBSVM package as a library for the support vector machines [8]. The programs are coded in C language using computer with Intel Core(TM) i5-3210M, 2.50 GHz CPU, 6.00 GB RAM of memory.

4.1 The diversity measure

To measure the diversity of the population of the evolutionary based methods, several measures have been proposed. Among them, we find: the entropy, the Hamming distance, and the moment of inertia. In [24] Morrison and De Jong showed that the moment of inertia and the Hamming distance give similar values. However, Hamming distance is time consuming compared to the moment of inertia. In this paper, we use the moment of inertia as a diversity measure.

When applying the moment of inertia calculation in the context of binary strings, each bit is assumed to be an independent ‘spatial’ dimension. Under these circumstances, the coordinates of the centroid, (\(C_{1}, C_{2}, C_{3}, \ldots , C_{l}\)) of bit strings of length l are computed as formula (14):

$$\begin{aligned} C_{i} = \frac{\sum _{j=1}^{j=p}x_{i}^{j}}{P} \end{aligned}$$
(14)

The moment of inertia (MOI) about the centroid is given in formula (15):

$$\begin{aligned} MOI = \sum _{i=1}^{j=l} \sum _{j=1}^{j=p}(x_{i}^{j}-C_{i})^2 \end{aligned}$$
(15)

Further details on this measure can be found in [24].

4.2 The dataset description

The present investigation analyses 22 unbalanced datasets available in UCI Machine Learning Repository [9]. The dataset description is given in Table 1.

Table 1 The dataset description

4.3 The dataset normalization

Before applying the classifier method, the dataset must be normalized. The main advantages of such operation are to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges, and to avoid numerical difficulties during the computation step. The experimental studies demonstrate that scaling the feature value facilitates SVM accuracy. The range of each feature value can generally be linearly scaled to the range \([-1,+1]\) or \([0,1]\) using formula (16), where \(X\) denotes the original value; \(X\) denotes the scaled value. \(MAX_{a}\) is the upper bound of the feature value \(a\), and \(MIN_{a}\) is the lower bound of the feature value \(a\). This current study scaled feature values to the range \([-1, +1]\).

$$\begin{aligned} X ^{'} = \left( \begin{array}{c} {\frac{X -MIN_{a}}{MAX_{a} -MIN_{a}}} \end{array} \right) \times 2 -1 \end{aligned}$$
(16)

4.4 The impact of the child selection strategy on the performance of MA + SVM

As already said, after having applied the crossover operator we obtain two individuals. One of them should be selected and sent to the next step for refinement by the stochastic local search method. The aim of this section is to show how we can select the appropriate child to be refined. We implemented three versions of MA + SVM with different child selection strategy:

  1. 1.

    MA + SVM (S1): uses the strategy (S1) that selects the best child.

  2. 2.

    MA + SVM (S2) : uses the strategy (S2) that selects the worst child.

  3. 3.

    MA + SVM (S3): uses the strategy (S3) that selects a random child with a fixed probability (\(pl\) = 0.5).

We evaluated the performance of the three strategies on diabetes dataset. We used the MOI diversity and accuracy measures to show the behavior of our method with the different three strategies for child selection.

Figure 2 compares the three strategies in term of diversity of individuals in the population while Fig. 3 compares the three strategies in term of fitness point of view.

Fig. 2
figure 2

Comparison between MA + SVM with the three strategies and GA + SVM on diabetes dataset in term of diversity point of view

As shown in Fig. 2, the three strategies can consistently maintain a good level of diversity as the evolution progresses. However, the first strategy S1 seems to be more interesting compared to both S2 and S3 strategies.

We compared also our method (MA + SVM with S1 strategy) with GA + SVM in term of diversity. As shown on the curves depicted in Fig. 2, the three strategies seem to be more useful than the simple genetic algorithm. This is due to the stochastic local search that ensures some diversity in the search space when combined with GA.

The same behavior can be seen when compared these strategies in term of quality point of view (Fig. 3). This comparison shows a good performance in favor of MA + SVM with S1. In overall, when we used S1 strategy, the diversity of the population is maintained and there is no loss in quality. For the rest of our experiments, we adopt this strategy (S1) to select the appropriate child for the refinement step.

Fig. 3
figure 3

Comparison between MA + SVM with the three strategies and GA + SVM on diabetes dataset in term of fitness points of view

4.5 Parameter settings

The adjustment of parameters of the proposed method is fixed by an experimental study. We carried several experiments and after a series of experiments, the different parameters are fixed empirically. The values of each parameter for the proposed method are given in Table 2.

Table 2 Parameters of MA + SVM method

4.6 Numerical results

In this section, we present and discuss the numerical results found by the proposed method. We implemented the two versions of our method which are: \(MA+SVM^{default}\) and \(MA+SVM^{optimized}\).

\(MA+SVM^{default}\) is a support vector machine with default parameters. \(MA+SVM^{optimized}\) is a support vector machine with optimized parameters. In order to optimize the parameters of SVM, we use an iterative method. The selected parameters are placed in a vector. The latter is evolved to find the best combination of parameters. The method modifies iteratively a random value in the vector, and evaluates the performance of the SVM. The new vector is saved when it achieves a better performance. The best values found are the optimized values for SVM. The two parameters considered in this study are: the penalty parameter \(C\) and the gamma parameter (\(g\)) for the radial basis function (RBF) kernel.

To show the effectiveness of the proposed method, a comparison is done with the pure SVM launched alone, the hybrid SLS + SVM [27, 28] and GA + SVM [27] proposed recently by Nekkaa and Boughaci [27].

Due to the non-deterministic nature of the two versions of MA + SVM, SLS + SVM and GA + SVM algorithms, 30 runs have been considered for each dataset and for each algorithm. The average, the worst and the best values of the classification accuracy rate founded by each method are reported in the paper. The best results are in bold font.

4.6.1 Comparison with a pure SVM

The proposed method was tested with SVM to investigate the performance of the additional attribute selection component. A comparison is also done with \(MA+SVM^{optimized}\) to show the performance of the optimized parameters. Table 3 gives the average classification accuracy rates obtained by the pure SVM, \(MA+SVM^{default}\) and \(MA+SVM^{optimized}\) on the considered datasets. We gives also the optimized parameters (\(C\) and \(g\)) for the optimized SVM for each dataset.

Table 3 \(MA+SVM^{default}\) versus \(MA+SVM^{optimized}\) versus pure SVM

As we can see from Table 3, the two versions of the proposed MA + SVM approach performs better than the pure SVM in all examined cases. The average classification accuracy rate for each dataset is improved significantly after feature selection. This result reveals that good results are also obtainable when fewer features are included in the model, which means that some features are redundant or insignificant relative to particular classification problems. Clearly, the two versions of MA + SVM approach can find a subset of features without decreasing the SVM classification accuracy. We can see also a good performance of the \(MA+SVM^{optimized}\). This confirms that in addition to feature selection, the optimization of SVM parameters significantly improves the classification accuracy.

4.6.2 Comparison between \(MA+SVM^{default}\), \(MA+SVM^{optimized}\), SLS+SVM and GA+SVM

Table 4 gives the average (\(Mean\)), the best (\(Max\)), the worst (\(Min\)) values of the classification accuracy, and the standard deviation (\(SD\)) obtained by the two versions of MA + SVM, SLS + SVM [27, 28] and GA + SVM [27].

Table 4 \(MA+SVM^{optimized}\) versus \(MA+SVM^{default}\) versus SLS+SVM versus GA + SVM

As shown in Table 4, the two versions of MA + SVM succeed in finding good results for almost the checked datasets compared to SLS + SVM and GA+SVM methods. However, \(MA+SVM^{optimized}\) is the most competitive. It succeeds in finding high quality solutions compared to the other tested methods. In term of classification accuracy point of view, we can see that \(MA+SVM^{optimized}\) outperforms significantly SLS+SVM, GA+SVM and \(MA+SVM^{default}\).

The standard deviations of the classifications accuracies are also noted. The small standard deviations presented show the consistency of the proposed method \(MA+SVM^{optimized}\).

The superiority of \(MA+SVM^{optimized}\) is due to the combination of feature selection based MA and parameters optimization for SVM.

In Table 5, we compared the four methods (\(MA+SVM^{optimized}\), \(MA+SVM^{default}\), SLS + SVM, GA + SVM) in terms of the number of selected attributes. The results clearly show that the four algorithms get the best classification rate with a small set of attributes. Moreover, in general the number of selected features of the \(MA+SVM^{optimized}\) is more efficient, with a good classification rate.

Table 5 A comparative study according to the number of selected features

The average computational time required for \(MA+SVM^{optimized}\), \(MA+SVM^{default}\), GA + SVM and SLS + SVM is shown in Table 6. As shown in Table 6, the four methods are comparable in term of computational time.

Table 6 A comparative study according to the average computational time in seconds of each method

In addition to the numerical analysis performed on various performance measures, we combined this analysis diagrams in boxes (box plots diagrams) to better visualize the distribution of values of the classification accuracy. Indeed, this diagram is an effective way to graphically present several groups of numerical data through their five—number—summary: the smallest observation (sample minimum), lower quartile \(Q_{1}\) (the bar at the bottom of the rectangle), the median \(Q_{2}\), \(Q_{3}\) the top quartile (the top bar of the rectangle), the largest observation (sample maximum) and the small circle represents the outlier values .

From the box diagrams of Figs. 4, 5, 6, 7, 8, 9, 10 and 11, we visualized the distribution of classification accuracy on the 30 runs for each algorithm and for each database test. The box plots and small deviation values show that in general \(MA+SVM^{optimized}\) produces consistent results. The results are very interesting and demonstrate the benefit of our approach in data classification.

Fig. 4
figure 4

Box plot for the Echocardiogram data set

Fig. 5
figure 5

Box plot for the Glass data set

Fig. 6
figure 6

Box plot for the Heart-swiss data set

Fig. 7
figure 7

Box plot for the Hepatitis data set

Fig. 8
figure 8

Box plot for the Lung-cancer data set

Fig. 9
figure 9

Box plot for the Sonar data set

Fig. 10
figure 10

Box plot for the Vehicle data set

Fig. 11
figure 11

Box plot for the Vowel data set

4.6.3 Further comparisons

In this section, we compared the \(MA+SVM^{optimized}\) method with some popular classifiers commonly used for Data mining purposes. These classifiers are: the rule-based algorithms C4.5 [29], the rule-learning scheme PART [13], the Naive Bayes [19], the Zero R, the Jrip and the algorithm Attribute Selection Classification from WEKA. These classifiers were used in this study by means of their default parameters originally set in WEKA.

Table 7 compares \(MA+SVM^{optimized}\) and the well-known classifiers on the considered dataset.

Table 7 A comparative study according to the classification accuracy rates

From the numerical results, the proposed approach \(MA+SVM^{optimized}\) succeeds in finding good results for the checked datasets. \(MA+SVM^{optimized}\) performs better on almost the cheked dataset with high accuracy (100 \(\%\) for Hepatitis and Dermatology datasets). We see a slight performance in favor of Naive Bayes classifier on Heart-c and Lymphography datasets. This proves its ability as a good classifier in Data mining.

5 Conclusion

Feature selection is an important issue in the construction of classification system. SVM is a conventionally adopted classification method. However, data without feature selection may be noisy, and may degrade the classification accuracy rate. This study proposed a GA and SLS based memetic algorithm that can search for a subset of beneficial features. This subset is then applied in the benchmark datasets to obtain the optimal classification outcomes.

In the proposed MA + SVM, GA was used to explore the search space and detect the potential regions with optimum solutions, while SLS was used to conduct an effective refinement on the potential regions obtained by GA. The SVM classifier is used to find the good classification for the considered data.

In addition, another hybrid algorithm of MA and SVM with optimized parameters was also developed. The performance between the two methods \(MA+SVM^{default}\) and \(MA+SVM^{optimized}\) was compared.

The performance of the proposed method was confirmed through experiments on several benchmark datasets. Experimental results show the effectiveness of our approach in particular \(MA+SVM^{optimized}\) in solving classification problem in Data mining. When MA + SVM based classifier combines both feature selection and parameters optimization, the search space is better explored and the method succeeds in finding high quality solutions. We plan to improve our framework by testing other very large scale problem in classification. We plan also to implement a parallel method to reduce the running time.