1 Introduction

Feature selection is the combinational optimization problem [1]. The feature selection aims to choose a subset of variables that have the ability to portray the input statistic. Meanwhile, they can also reduce the noise or other irrelevant variables impact and accurate prediction outcomes can be offered [2].

In classification, data sets tend to include numerous features, usually involving unrelated and redundant characteristics. Nevertheless, in the light of substantial search space, unrelated and unnecessary features have no efficient usage for categorization. Moreover, they can weaken the classification’s achievement and increase the computation time, which is called “the curse of dimensionality’ [3].

For the sake of dealing with this issue, a variety of feature selection techniques have got introduced. The main aim of feature selection consists in eliminating the irrelevant and unnecessary characteristics and choose relevant features from the large feature set. Generally, these methods fall into three categories: filter, wrapper, and embedded. The first one method has nothing to do with particular learning algorithms which make the data reasoning of the feature set so as to choose a differential subset of the features with no thought of internal relationship with the learning algorithm. Techniques of this kind involve the information gaining [4], documents recurrence [5], term strength information [6], Chi square [7], and odd ratio [8]. Widespread application was made so as to make feature selection lessen computational complexity, particularly under the circumstance of the extremely large spatial characteristic space, such as text. Wrapper methods which include study methods being one part of the assessment process. Some typical methods of wrapping methods involve sequential floating selection [9], sequential forward floating selection (SFFS) [10] and sparse logistic regression based methods [11]. The second one refers to the embedded method. Embedded techniques [12,13,14] involve fluctuating choosing, one part of the training phase, with the condition of no separation of statistic into test and training sets.

Considering the fact that meta-heuristic techniques are able to figure out problem-solving answer rapidly with enough research room with the help of several search methods around the world, recently lots of researchers have made efforts to apply meta-heuristic technique for the sake of coping with the feature selection issue. For example, Huang et al.(2008) [15] put forward a brand-new PSO-SVM modeling which mixed the PSO and served as a backbone for vector machines(SVMs) for the sake of enhancing its categorization correctness and meanwhile making the input feature sub-group selection become better. Neshatian et al. (2009) [16] applied Genetic Programming for Feature Subset Ranking in Binary Classification Problems. Chen et al. (2010) [17] have proposed a novel feature selection that hybridize ant colony optimization (ACO) with rough set theory which can get a higher accuracy. Xue et al. (2016) [18] have presented a PSO which has multiple goals for feature selection.

Firefly algorithm (FA) is a meta-heurisitic search on the basis of swarm-intelligence and upgrade method. It encourages the flashing and communication act of fireflies. Because it’s very simple, able to share information, quickly come together and it’s on the basis of population, at the beginning, some modified variants were proposed that have been successfully explored in various fields such as continuous optimization [19], multimodal [20], constrained optimization [21], and later in real-world problems such as non-convex economic dispatch problems [22], clustering [23], combinatorial optimization [24], image compression [25]. Up to now, FA has got successful usage for lots of challenging upgradation issues as well as NP-hard issues (Yang 2008). In this paper, FA gets applied for feature selection issue. Nevertheless, several shortcomings of FA for feature selection task do exist. At the beginning, different initialization strategies in FA perform differently in different problems. Secondly, in the case that the value of the gbest shows no variations of a defined amount of repetition, it will suffer from being fastened in regional fitness at early stages, then swarm diversity is decreased, Position mutation is needed for the sack of enhancing its search ability and diversity.

1.1 Goals

In this paper, we use several adjustments to avoid the swarm stagnation into local optimal and the premature: (1) begin with a population of excellent methods we apply opposition-based strategy for population initialization, (2) in case that the value of gbest show no variations for a fixed amount of iterations, we use opposition position of the gbest firefly to take the place of the most unfavorable fit particle.

1.2 Organisation

Other part of this research is constructed as we can see: at the very beginning, we present the current research. A related work follows in Sect. 2. Detailed definition of this methodology follows in Sect. 3. Section 4 explains experiment outcomes and comparison of differential models. At last, conclusions are arrived at in Sect. 5.

2 Related Work

2.1 Firefly Algorithm

FA, an algorithm, serves as a target of vital significance formulated through Yang (2008) which gets motivation from social insects called fireflies. Fireflies belong to insects, the primary features of which involves admirable flashing lights. The fireflies flashing patterns which is produced by a bioluminescence procedure, enjoy a special place for each of 2000 current living fireflies species. Two main purposes of these flashing are to attract the potential prey and to mate partners.

To simplify the FA, there exist the following 3 idealized rules (1) The total amount of fireflies have one sex. As a result, sexual attraction can happen to them. (2) The level of attractiveness has positive relationship to the brightness, meaning that the one with less bright tends to be attracted by the one having more light. Under the circumstance that there is no one with the largest amount of light, random attractiveness will happen. (3) The light intensity of a firefly gets influenced by the outlook of the unbiased function. The pseudocode is given in Algorithm 1.

The foundation of the attractiveness and light intensity function as significant matters.

Light intensity can be formulated as follows:

$$I = I_{0} e^{{ - \gamma r_{ij}^{2} }}$$
(1)

In this equation, \(I_{0}\) refers to the light intensity at the beginning.

The attractiveness of a firefly results from the light intensity. The attractiveness can be approached as follows:

$$\upbeta = \beta_{0} e^{{ - \gamma r_{ij}^{2} }}$$
(2)

where \(\beta_{0}\) is a constant of attractiveness at \({\text{r}} = 0\). γ is light absorbtion coefficient, which is fixed as 1.0 in FA.

The distance between any two fireflies i and j at xi and xj, is the Cartesian distance as bellowing:

$$r_{ij} = \sqrt {\left( {x_{i} - x_{j} } \right)^{2} + \left( {y_{i} - y_{j} } \right)^{2} }$$
(3)

The act of a firefly i is admired by another more appealing (with more light) firefly j which gets impacted

$$x_{i} = x_{i} + \beta e^{{\gamma r_{ij}^{2} }} \left( {x_{j} - x_{i} } \right) + \alpha \left( {rand - \frac{1}{2}} \right)$$
(4)

From what we can see above, the second one results from attraction. The third one is the disordered parameter, rand is a number producer in disorder having even distribution in [0, 1]. Under most circumstances of completion, value is taken as β0 = 1 and α ∈ [0, 1]. For more details of the firefly algorithm it can be seen in Yang (2009) and Gandomi et al. (2013).

2.2 Opposition-Based Learning (OBL)

Opposition-based learning (OBL) is originally put forward by Tizhoosh (2006) which gets used to get the optimum solution of a given problem by taking the corresponding method and the anti-solution at the same time.

In general,meta-heuristic algorithms begin with several original solutions (original population) with the purpose of enhancing the group to end up with the global optimal solution(s). Procedure of the searching comes to the end when several previously fixed standards got met. Without previous knowledge about the method, the commonplace initialization will begin with random sampling distribution on the whole range. Under the worst circumstance, in case that the best method is too far away from the random sampling, the computation period will last long. So, complexity of time is on the rise. Imagine that we simultaneously test a method and the totally opposite one, the fitter one is able of being chosen as the initial method. Actually, as the previous research of Tizhoosh indicates that fifty percent of the time a rough thought has long distance from being called ideal solution than the adverse thoughts. Consequently, beginning with the original groups that involve the best of the two guesses tends to be much better. In this research, first, a brand-new form of the OBL strategy was adopted to begin with an excellent likely methods, second, we apply it to make the search methods more various under the circumstance of backwater of the best firefly. The concept of adverse amount, opposition-based initialization is given the following explanation:

Definition 1

Let \(x \in \left[ {m,n} \right]\) be a real number. The opposition number \(\tilde{x}\) is defined by

$$\tilde{x} = m + n - x$$
(5)

Correspondingly, the adverse point in D-dimensional space are described as bellowing.

Definition 2

Let \({\text{X}} = \left( {x_{1} ,x_{2} , \ldots x_{d} } \right)\) be a point in D-dimensional space, in which \(x_{1} ,x_{2} , \ldots x_{d} \in R\) and \(x_{i} \in \left[ {m_{i} ,n_{i} } \right]\), \(\forall \;{\text{i}} \in \left\{ {1,2, \ldots ,D} \right\}\) The opposition point

$$\tilde{x} = m_{i} + n_{i} - x_{i}$$
(6)

Definition 3

Assume that \(X = \left( {x_{1} ,x_{2} , \ldots x_{d} } \right)\), a point in D-dimensional space, be a candidate solution. Imagine f(.) is a fitness function which gets applied to assess the candidate’s fitness. In accordance with the definition of the opposite point, \(\tilde{X} = \left( {\tilde{x}_{1} ,\tilde{x}_{2} , \ldots \tilde{x}_{d} } \right)\) is the opposite \({\text{X}} = \left( {x_{1} ,x_{2} , \ldots x_{d} } \right)\). Now, in case that \(f\left( {\tilde{X}} \right)\, \ge\, f\left( X \right),\) the candidate solution X can be taken place by \(\tilde{X}\), or else, keep moving with X. Therefore, the point and its opposite point are assessed to keep moving with the fitter one in the same time.

3 Description of the Proposed Algorithm (MFA)

In this part, the main content is the description of the raised FS algorithm. The main goal involves establishing a global search method. This method has both excellent behavior of coping with feature selection issue and easily implementation.

3.1 Encoding of Fireflies

Unlike the existing studies adopting the binary string, in this paper, we use a probability strategy (Algorithm 2) which can represent a feature selected into the feature subset and be applied as an encoded element. In this techniques various elements fashion a firefly which stands for a candidate method of this issue. Taking a data set with D features as one instance, the ith firefly in the swarm is symbolized with a D-bit real String as below:

$${\text{X}} = \left( {x_{i1,} x_{i,2,} x_{i,3, \ldots } x_{i,d} } \right), \;\;i = 1,2, \ldots m$$
(7)

In this equation, m is the swarm size, and \(x_{i,j} \in \left[ {0,1} \right]\) means the possibility of the jth feature being selected in the next subset.

For a firefly \(X_{i}\), it can be decoding to a solution \(Z_{i}\) which is be established as follows:

$$Z_{i,j} = \left\{ {\begin{array}{*{20}c} {1,} & {x_{ij} \ge rand} \\ {0,} & {otherwise} \\ \end{array} } \right.$$
(8)

where \(z_{ij} = 1\) stands for that the j-th feature is selected into the feature subset \(Z_{i}\). For example, the firefly 10100010 with 8 features indicates that the 1st, 3rd and 7th features are chosen.

3.2 Fitness Function

The fitness function gets applied so as to assess the efficiency of categorization practice and the amount of features, in which the weight for the number of features is extremely tiny. The fitness function can be seen by Eq. (9).

$$Fitness = ErrorRate + \alpha \times \# Features$$
(9)
$$ErrorRate = \frac{FP + FN}{TP + TN + FP + FN}$$
(10)

Error Rate stands for the training classification error of the chosen features, Error Rate can be calculated by FP, FN, TP and TN. FP, FN, TP and TN represents false positives, false negatives, true positives, and true negatives, respectively. \(\alpha\) indicates the related significance of the amount of the features. In this paper, \(\alpha\) is made to serve an extremely small value to make sure that numbers of features are always smaller than Error Rate. So, classification performance can be calculated by Eq. (9), which can find out the feature sub-group with small categorization error ratio.

3.3 Proposed Method for Feature Selection

This part gives analysis of the raised technique in great detail. Its aim involves applying the fortified algorithm MFA for the feature selection in classification. The raised methodology can be seen below and an algorithmic flow can be seen in Algorithm 4.

For the sake of increasing the FA search capability and decreasing the local optima trapping probabilities, a brand-new adjustment techniques is presented (MFA). Mainly two thoughts exist in the algorithm. One is the opposition-based population initialization method which being used to improve the population diversity. The other is the opposition strategy (Algorithm 3) which can encourage the whole firefly population to come straight to the best potential local or global individual.

4 Experimental Results and Analysis

4.1 Datasets and Parameter Settings

All experiments are conducted on the ten datasets (Table 1) selected from the UCI. These datasets appear various amount of features, classes and examples that are chosen to check on the raised algorithm. For each datasets, the examples fall into two sets in disorder: seventy percent being the training set and thirty percent being the check set.

Table 1 Datasets

Feature selection is binary problem, so representation of the firefly is an “n-bit” string, in which “n” is the whole amount of features in the dataset.

For each dataset, the experiments to examine the feature selection performance of each algorithm has been established for thirty independent times. The parameters are defined: \(\beta_{0} = 1\), \(\gamma = 0.2\), \(\alpha = 0.1\), population number is 30, and the largest iteration is 100.

As a wrapper method, the raised algorithms needs the learning algorithm. KNN is acknowledged as being easy and often applied learning algorithm gets adopted in the research and K = 5(5 NN).

4.2 Comparison Results of MFA and Other Methods

Table 2 indicates the research outcomes by using FA, MFA and PSO to test the accuracy of the raised algorithm on 30 runs. In Tables 2 and 3, the first one, the second one, the third one represents the mean, the excellent and the standard deviation classification accuracy received from the thirty runs on each test set.

Table 2 Comparisons between FA, MFA and PSO
Table 3 Average numbers of feature selected from the different datasets

According to Table 2, we can see that MFA achieved the most excellent classification performance in the three algorithm in majority of the datasets. The classification performance of MFA was similar to FA on one dataset, better than FA on seven datasets, but worse than FA on three datasets, which proves that the MFA has the advantage of the finding out the feature space adaptively more excellent in comparison with the other techniques. As show in Table 2, we can also see that the stand deviation is minimum for the MFA in five datasets than FA and PSO, which proves that MFA outperforms the other algorithms in its stability, and ability to reach optimal.

From Table 3, it’s obvious that features subset chosen by MFA were larger than FA on all the ten datasets but smaller than PSO on 7 of 10 datasets. The main reasons are the classification performance that tends to be more significant than the amount of features as considered.

To prove the effectiveness of the MFA algorithm, three existing feature selection algorithms such as ReliefF, sequential forward selection (SFS) and MIM are applied on the same datasets such as the German, Ionosphere, Vehicle and Lung. From the Fig. 1 we can see that the MFA provide higher classification accuracy rates compared to existing feature selection.

Fig. 1
figure 1

Classification accuracy rates using different feature selection algorithms on the datatsets

5 Conclusion

In this work an improved firefly algorithm method is put forward for feature selection in wrapper mode. The continuous version form of firefly algorithm (FA) is transformed into the binary form by discrete coding. An improved FA employs opposition-based learning in population initialization and opposition strategy in the searching process which fastens the convergence rate to obtain the global optima. These experimental results on datasets indicate that the raised algorithm MFA can obtain better classification accuracy than other methods.