Keywords

1 Introduction

With the development of the information and computer technologies, big data mining has been widely employed in various industries, such as finance and stock market, traffic, tourism, health records, social security, science data, and so forth. As everyone knows, the big data containing three problems that is velocity, variety and volume problems, and big data comprises various types of data [1]. These problems are the main challenges for big data mining. Therefore, the traditional data mine method is lead to extreme time consuming and complexity [2]. It cannot meet the demand of big data process and time requirement obviously.

In data mining algorithms, the feature selection model is a key issue. It select the most important feature to improve the performance of classifier, which is employed to predict classes for new samples. There are some studies have been made according to this problem. As an efficient algorithm, Heuristics has been widely applied in feature selection [3], which most is top down supervised learning. Additionally, the heuristics require full set of data in training processing. It is not suitable to dynamic stream processing environment [4], and many feature selection algorithm is only developed for some special application areas. As a general rule, the convergence speed of traditional algorithm is still slow and may not converge to a global minimum [5]. Therefore, it necessary to design a high performance algorithm for feature selection.

2 Feature Selection Algorithm Based on Genetic Algorithm

Genetic algorithm is an efficient heuristic method which is widely employed to get global solution within solution space. Therefore, to solve the feature selection problem, an adaptive genetic algorithm based was proposed. And the genetic operators such as selection, crossover and mutation were designed in following section.

2.1 Initial Population

Initial population contains n features as \( F = (f_{1} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} f_{2} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \cdot \cdot \cdot \cdot \cdot \cdot ,f_{n} ) \). The initial population which referred as G = Ln. L is represent the number of bits to the corresponding feature. In order to improve calculative efficiency and accelerate the convergence, we set its value to 5.

The fitness value of individual was calculated according to the following formula.

$$ P = \frac{{kr_{cf} }}{{\sqrt {k + k(k + 1)r_{ff} } }} $$
(1)

Where k is constant. The r cf and r ff represent the forward and backward time delay respectively. In calculating process, if P > ɛ, then the algorithm is terminated. Where the ɛ is threshold.

2.2 Selection

In order to avoiding premature convergence sometimes occurs, the adaptive selection according to the following probability.

$$ Q = \alpha P + \frac{{(e - e^{{k/k_{\hbox{max} } }} )}}{{e + e^{{k/k_{\hbox{max} } }} }} $$
(2)

Where Q is the fitness value of next generation. The selection operation are produced according to the probability Q.

2.3 Crossover

The adaptive crossover probability can be computed as follows:

$$ P_{c} = \left\{ {\begin{array}{*{20}l} {P_{c2} ,\,f_{avg} > f^{\prime},\,P_{c2} \le 1} \hfill \\ {P_{c1} (f_{\hbox{max} } - f^{\prime})/(f_{\hbox{max} } - f_{avg} ),\,f_{avg} \le f^{\prime},\,P_{c1} \le 1;} \hfill \\ \end{array} } \right. $$
(3)

Where \( f^{\prime} \) is the individual which fitness is bigger than another in crossover process. f represented the fitness of individual will be crossover. \( P_{c1} {\kern 1pt} {\kern 1pt} ,P_{c2} {\kern 1pt} {\kern 1pt} ,{\kern 1pt} P_{m1} {\kern 1pt} {\kern 1pt} {\kern 1pt} ,P_{m2} \) were parameters and f max, f avg is the max fitness and average fitness of last generation, respectively.

2.4 Mutation

The purpose of mutation is to introduce a slight perturbation to increase the diversity of trial individuals after crossover, preventing trial individuals from clustering and causing premature convergence of solution. The probability of mutation is calculated as follows:

$$ P_{m} = \left\{ {\begin{array}{*{20}l} {P_{m2} ,\,f_{avg} > f,\,P_{m2} \le 1} \hfill \\ {P_{m1} (f_{\hbox{max} } - f)/(f_{\hbox{max} } - f_{avg} ),\,f_{avg} \le f,\,P_{m1} \le 1;} \hfill \\ \end{array} } \right. $$
(4)

The steps involved the proposed algorithm are listed below:

(Step 1) Initialize the population. Crossover probability and the probability of mutation were computed by using (3) and (4) respectively.

(Step 2) The fitness of individual were computed by using (1). And the algorithm is whether return determined by termination condition.

(Step 3) The adaptive selection was completed according to the (2), and the next generation was obtained.

(Step 4) The crossover probability was computed by using (3), A random λ was generation for individual as pair. If P c  > λ, then start crossover operation.

(Step 5) The mutation probability was computed by using (4), and a random η in the range [0, 1] was generation. When P m  > η, the mutation is started.

(Step 6) The next generation is obtained. If the termination condition is satisfied, then the optimal feature subset is return, else goto (step 2).

3 Experimental Results

In this section, we evaluate the performance of our proposed feature selection algorithm through computer simulation. The simulation is performed by the matlab. To validate the performance of proposed algorithm, the BIF and C-F which are two kinds of represented algorithms in feature selection are compared [6]. The common dataset is obtained from UCI machine learning repository [7]. The dataset is given as Table 1.

Table 1. The datasets using in experiments.

In order to ensuring the fairness of experiment, Each algorithm would select the same number of features. As the classical leaning algorithm, the decision tree is used in test. And the experimental environment is the Weka. The leaning algorithm run three times and took the average. The result is show as Table 2.

Table 2. Size of feature subset for BIF algorithm

The results shown in Tables 1 and 2 revealed that our proposed algorithm could select the smallest feature subset, because it can take out of the sample which has been identified. Furthermore, the classification accuracy of each feature subset was shown as following.

Experiment results are shown in Table 3 for each algorithm. From the simulation results, we can observe that the proposed algorithm can has better classification accuracy over the competing algorithm by using fewer features.

Table 3. Classification accuracy of each feature subset

The accuracy of classification are shown in Figs. 1 and 2 for each algorithm. From the result, we can conclude that the proposed algorithm have better accuracy when comparing with other algorithm.

Fig. 1.
figure 1

The accuracy of classification for Anneal

Fig. 2.
figure 2

The accuracy of classification for Winc

4 Conclusions

A feature selection algorithm based on the adaptive genetic was proposed in this paper. The method to compute the crossover probability and mutation probability were designed according to adaptive genetic. The algorithm realized adaptive feature subsets selection and optimization. Experimental results show that the proposed algorithm achieves notable classification accuracy improvements, and reduced the total computing time, comparisons with the conventional scheme.