1 Introduction

The field of machine learning provides an application of computer-based approach which is appropriate for the analysis of different types of datasets, and these approaches are developed and improved with experience (Ahmadi and Mahmoudi 2016; Ahmadi 2015b; Ahmadi and Bahadori 2016; Ahmadi et al. 2015d; Ahmadi and Shadizadeh 2012). Machine learning techniques solve the problem of clustering, classification, prediction and various other problems by using the application of supervised, unsupervised and semi-supervised method (Ahmadi et al. 2014b, c, d, e, f, g; Ahmadi and Ebadi 2014). One of the major applications of microarray data analysis is to perform sample classification for diagnostic and prognostic of disease. Some of the examples of machine learning techniques that have been used in cancer classification of microarray data include the decision tree, neural networks, support vector machine and the Naïve Bayesian classifier. However, small size of samples in comparison with high dimensionality is the main difficulty for most of the machine learning techniques. This problem is known as ‘curse of dimensionality.’ Dimension reduction is one of the main applications that plays an important role in the DNA microarray data classification (Lazar et al. 2012; Saeys et al. 2007). For dimension reduction, there are two important algorithms: feature extraction and feature selection. Feature extraction algorithm transforms the feature into the lower-dimensional space by using the combinations of the original features. The feature selection method selects the most relevant features from the entire features to construct the model for classification. Feature selection algorithms can be arranged into three types, namely filter, wrapper and embedded methods. Filter methods are the ones that select features as a preprocessing step and they select features without considering the classification accuracy, while the wrapper method is a repetitive search process in which the results of the learning algorithm in each repetition are used to guide the search process (Kohavi and John 1997) since wrapper methods continuously use the learning algorithm in the search process so that their computational cost is more, especially for high-dimensional datasets. The third type of feature selection algorithm is embedded method (Ahmadi and Golshadi 2012; Ahmadi 2011; Ahmadi et al. 2015h). The difference of embedded approach compared to other filter and wrapper approach is the search mechanism that is built into the classifier model. Recently, a hybrid search technique has been used to take the advantages of the extraction/filter and the wrapper approach. In the hybrid search algorithm, the first subset of features is selected or extracted based on the filter/extraction method, and after that, the wrapper method is used to select the final feature set. Therefore, the computational cost of the wrapper method becomes acceptable due to the use of reduced size features. Information gain and a memetic algorithm (Zibakhsh and Abadeh 2013), Fishers core with a GA and PSO (Zhao et al. 2011), mRMR with ABC algorithm (Alshamlan et al. 2015a) and independent component analysis with fuzzy algorithm (Aziz et al. 2016) are recently used hybrid methods to solve the problem of dimension reduction of microarray.

ICA is a multi-dimensional statistical method for finding the hidden information that is situated under a set of random variables (Hyvarinen et al. 2001). Nowadays, ICA technique has received growing attention as effective dimension reduction algorithm for NB classification of high-dimensional data (Kong et al. 2008). The reason for that is the conditional independence hypothesis rooted in the algorithm of NB classifier which could be successfully resolved as the components extracted by the ICA are statistically independent. There still exists an unsolved problem that is how one can choose a subset of independent component (IC) that improves the performance of base classifier. To solve this problem, different authors used different wrapper methods to choose best subset of the IC. For example, the sequential floating forward selection (SFFS) method is used to choose the ICA feature vector for SVM classification of microarray data (Zheng et al. 2006). Zheng et al. (2008) classified gene expression data with consensus independent component analysis (ICA) as a dimension reduction technique. A sequential feature extraction method was used to choose best genes set of independent component vector for NB classifier (Fan et al. 2009). Some of the authors used different filter methods for ranking of ICA feature vector to increase the classification accuracy of SVM and NB classifier (Rabia et al. 2015a, b). On the other hand. bio-inspired evolutionary techniques-based wrapper methods such as ant colony optimization (ACO) (Tabakhi et al. 2014), genetic algorithm (GA) (Huang and Wang 2006), particle swarm optimization (PSO) (Lin et al. 2008), bacterial foraging algorithm (BFA) and fish school search (FSS) are more relevant and provide a more exact solution than the other filter-based wrapper techniques because they have the ability to search and find the optimum or near-optimum solutions on high-dimensional solution space (Arqub and Abo-Hammour 2014; Ahmadi 2016; Ahmadi et al. 2015c, e, f, g; Ahmadi and Bahadori 2015; Baghban et al. 2015). These bio-inspired algorithms have been effectively applied for resolving the problem of dimension reduction in various applications such as financial domains, face recognition and text classification (Abo-Hammour et al. 2014). While on the other hand, the result of these bio-inspired techniques depends on the complexity of the search space, fitness function, the parameters used for the algorithm, convergence, etc. (Ali Ahmadi and Ahmadi 2016; Ahmadi 2015a; Ahmadi et al. 2014a, 2015a, b; Shafiei et al. 2014). All of these methods are classifier-based algorithms and have obtained satisfactory performance for dimension reduction in different types of fields, and they have not been frequently used for feature selection of DNA microarray data due to high computational cost. But the computational cost of the hybrid technique is lesser than only wrapper technique because of the use of the reduced number of features in its second step.

Since the ICA aims to address the issues arising from NB classification of microarray data, ABC-based wrapper approach with NB classifier is applied to optimize the ICA feature vectors for finding the smallest number of features that improved the classification accuracy of NB. In the research paper, we focus on the impact of the proposed algorithm ICA + ABC with NB classifier and how the performance of NB classifier improves using this combination. The proposed ICA + ABC hybrid algorithm is an iterative upgrading computational process where the population of agents chooses a different subset of features in each iteration. After that, the performance of the different subsets of features is estimated using the classification accuracy of NB (Fig. 1).

Fig. 1
figure 1

Schematic representation of the proposed methodology

2 Proposed approach

2.1 Feature extraction by ICA

Independent component analysis is a feature extraction technique, which was proposed by Hyvarinen to solve the typical problem of the non-Gaussian processes and has been applied successfully in different fields (Hyvarinen et al. 2001). The extraction process of ICA is very similar to the algorithm of principal component analysis (PCA). PCA maps the data into an another space with the help of principal component. In place of principal component, the ICA algorithm finds the linear representation of non-Gaussian data so that the extracted components are statistically independent. Theory of ICA algorithm can be found elsewhere (Aziz et al. 2016; Hsu et al. 2010).

2.2 Feature selection by ABC

Artificial bee colony (ABC) is an evolutionary feature selection algorithm that is used to select a best feature subset. The ABC algorithm reproduces the appearance of the intelligent foraging behavior of honey bee swarms for the optimization of feature subset, introduced by Dervis Karaboga in 2005 (Karaboga 2005). Garro and Beatriz applied ABC for selecting best genes set of microarray data with distance classifier for artificial neural network (ANN) classification. ABC algorithm is a combination of local search and global search method managed by three classes of bees (employed, onlooker and scouts bees) (Garro et al. 2016). These three classes of bees with different works in the search space (colony) find the convergence of the problem near to the optimal solution.

Employed bees: These bees search new neighborhood food around their hive. After that, they compare the new food source with the old food source by using Eq. (1):

$$ v_{i}^{j} = \, x_{i}^{j} + \, \phi_{i}^{j} \left( {x_{i}^{j} - x_{k}^{j} } \right) $$
(1)

where vij is a newly generated solution and k  i. \( \phi_{i}^{j} \) is a random number between [− 1, 1]. If the fitness value \( v_{i}^{j} \) is better than \( x_{i}^{j} \), then \( x_{i}^{j} \) changed to \( v_{i}^{j} \), otherwise \( x_{i}^{j} \) unchanged.

Onlooker bees: Employed bees share this information of solution with onlooker bees. Then, by using the information of employed bees, onlooker bees find a food source with the probabilities related to their amount of nectar. This probability of finding the food source is calculated by using Eq. (2):

$$ p_{i} = \frac{{{\text{fit}}_{i} }}{{\sum\nolimits_{k = 1}^{NB} {{\text{fit}}_{k} } }} $$
(2)

Scout bees: If the fitness value of the ith solution (fiti) cannot be improved longer during a predefined number called ‘limit,’ then this criterion is called ‘abandonment criteria.’ For this type of criterion, the scout bee creates new solutions to replace ith solution by using Eq. (3):

$$ x_{i}^{j} = \, x_{ \hbox{min} }^{j} + \, rand(0,1)\left( {x_{ \hbox{max} }^{j} - x_{ \hbox{min} }^{j} } \right). $$
(3)

3 Classifier (NB)

Naïve Bayes is a simple supervised learning algorithm for machine learning classification. NB classifier used Bayes’ rule with strong independence assumption for mining of different types of data (Friedman et al. 1997; Hall 2007). Due to its simplicity, Naïve Bayes is an attractive classifier among the researchers for solving different classification problems, including microarray (Chen et al. 2009; Sandberg et al. 2001). It is found that its performance is more robust and efficient compared to other supervised machine learning classification algorithms. More detailed information on NB can be found elsewhere (Aziz et al. 2016; Fan et al. 2009).

4 Experimental setup

To evaluate the performance of ICA + ABC, six benchmark classification datasets of the microarray, including the colon cancer (Alon et al. 1999), acute leukemia (Golub et al. 1999), prostate cancer (Singh et al. 2002), lung cancer II (Gordon et al. 2002), high-grade glioma data (Nutt et al. 2003) of binary classification and leukemia 2 (Armstrong et al. 2002) of multi-classification, are used. Table 1 shows the detailed description (number of classes, number of features, etc.) of these datasets.

Table 1 Summary of six high-dimensional biomedical microarray datasets (Kent ridge online repository)

In this study, NB classifier with Gaussian distribution estimation is used for microarray data (Rabia et al. 2015b). The goodness for each training subset is estimated by leave-one-out cross-validation (LOOCV) classification accuracy of NB classifier that plays the role of fitness value of ABC. The performance of each feature selection algorithm is based on two criteria, the classification accuracy of NB and the smallest number of selected features that have been used for classification of data. Classification accuracy of NB is the overall correctness of the classifier, and it is calculated by the formula shown below:

$$ {\text{Classification Accuracy}}\, = \,\frac{\text{CC}}{N}\, \times \,100 $$
(4)

where N is the total number of samples in the original microarray dataset and CC refers to correctly classified samples. For statistically validating the experimental results, each gene selection algorithm was implemented 30 times using the fitness value. On the other hand, parameters of ABC are selected on the basis of the studies of several related research articles concerned with the selection of ABC parameter (Akay and Karaboga 2009; Abu-Mouti and El-Hawary 2012; Alshamlan et al. 2015b; Garro et al. 2016). The parameter of ABC algorithm that was used in our experiments is given below:

  • Bee colony size = 100.

  • The maximum cycle = the maximum number of generations (100)

  • Number of runs = 30 runs

  • Limit = 5 iterations.

For implementations of ICA, MATLAB software package (R2014a) with the FastICA algorithm is used, and it can be found from the Internet (http://research.ics.aalto.fi/ica/fastica/code/dlcode.shtml). Codes for ABC feature selection are freely available on the Internet (http://mf.erciyes.edu.tr/abc/).

5 Experimental results and discussions

Tables 2, 3, 4, 5, 6 and 7 show the LOOCV estimations of the test set, classification accuracy rates of NB classifiers on the above six cancer datasets of microarray, with features selected by ICA + ABC and mRMR-ABC algorithms. The same parameters of ABC algorithm are used for ICA + ABC and for mRMR + ABC algorithms for the sake of fair comparison. The optimal results of all datasets (highest accuracy with minimum selected gene size) are highlighted using bold font. The ROC curve with a different value of threshold (Song et al. 2014) for six best selected subsets of genes obtained by ICA + ABC and mRMR + ABC methods for different binary datasets is shown in Figs. 2a, b, 3a, b, 4a, b, 5a, b and 6a, b. The ROC of multi-class dataset is represented in Fig. 7a, b with best subset of genes.

Table 2 Comparison between ICA + ABC and mRMR-ABC algorithms classification performance when applied with the NB classifier for colon dataset
Table 3 Comparison between ICA-ABC mRMR-ABC algorithms classification performance when applied with the NB classifier for acute leukemia data
Table 4 Comparison between ICA-ABC and mRMR-ABC algorithms classification performance when applied with the NB classifier for prostate tumor data
Table 5 Comparison between ICA-ABC and mRMR-ABC algorithms classification performance when applied with the NB classifier for high-grade glioma data
Table 6 Comparison between ICA-ABC and mRMR-ABC algorithms classification performance when applied with the NB classifier for lung cancer II data
Table 7 Comparison between ICA + ABC mRMR-ABC algorithms classification performance when applied with the NB classifier for leukemia 2 dataset
Fig. 2
figure 2

ROC curve with six best selected subsets of genes with ICA + ABC and mRMR + ABC algorithms for NB classifier of colon dataset

Fig. 3
figure 3

ROC curve with six best selected subsets of genes with ICA + ABC and mRMR + ABC algorithms for NB classifier of acute leukemia dataset

Fig. 4
figure 4

ROC curve with six best selected subsets of genes with ICA + ABC and mRMR +ABC algorithms for NB classifier of prostate cancer dataset

Fig. 5
figure 5

ROC curve with six best selected subsets of genes with ICA + ABC and mRMR + ABC algorithms for NB classifier of high-grade glioma dataset

Fig. 6
figure 6

ROC curve with six best selected subsets of genes with ICA + ABC and mRMR + ABC algorithms for NB classifier of lung cancer II dataset

Fig. 7
figure 7

ROC curve with the best selected subset of genes with ICA + ABC and mRMR + ABC algorithms for NB classifier of leukemia 2 dataset

Now, the following observations can be made from Tables 2, 3, 4, 5, 6 and 7 and Figs. 2a, b, 3a, b, 4a, b, 5a, b, 6a, b and 7a, b

  1. 1.

    As can be seen from Tables 2, 3, 4, 5, 6 and 7, the proposed algorithm, ICA + ABC, was able to predict the best gene subset for resolving the classification problem of different datasets. To obtain best gene subsets, initially ICA chooses an average of 50 to 180 genes from a set of 2000–12,500 genes from a different training dataset of the microarray. After that, ABC selects a different subset of genes from the set of ICA feature vectors of different datasets, then finds the classification accuracy of NB with these selected gene subsets and obtains a smallest gene set that gives better accuracy of NB classifier and finally finds the classification accuracy of test data. By the repetition of this process, the original thousands of genes of different datasets were reduced to within an average of 15.4 genes for highest classification accuracy.

  2. 2.

    The positive outcome with the smallest number of genes selected by ABC from ICA feature vectors is clearly noticeable with respect to classification accuracy. For the colon cancer data, the mean classification accuracy with all ICA features was 70.71%, while with the ABC wrapper approach it increases 91.12% with 16 genes. In the same manner, ABC obtained an average classification accuracy of 96.55% for acute leukemia with 12 genes, while the average classification accuracy had been less than 66.82% with all 72 ICA feature vectors. Same situations can also be found in all other remained datasets with the ICA + ABC algorithm. Therefore, ABC has a powerful optimization algorithm for providing best genes subset with ICA for NB classification of microarray.

  3. 3.

    We can observe that the AUC (area under the ROC curve) values of NB classifier varied with different sizes of gene subsets, and the best subset of genes for different datasets is different that gives greatest AUC value. The best AUC value of colon data is 95.05 with 16 selected genes for the 0.3 threshold value, and the number of 12 genes gives the highest AUC for acute leukemia data with the value of 97.11 for the 0.7 threshold value. The highest AUC of the prostate and high-grade glioma with 0.5 threshold value is found at 94.21 and 93.65 with 16 and nine genes, respectively. For the lung cancer data II, the best value of AUC is at 92.87 with 24 selected genes for the 0.2 threshold value. We also plot ROC for leukemia 2 data for the best subset of genes obtained by ICA + ABC and mRMR + ABC. The black curve depicts the ROC when the class one is separated from class two and three, while the red curve shows the ROC when class two is separated from class three and one, and similarly, the blue curve shows the ROC when the class 3 is separated from the class one and two. The average AUC value with proposed approach is 95.97 and 94.79 for mRMR + ABC algorithm. It can be seen also from Figs. 2a, b, 3a, b, 4a, b, 5a, b, 6a, b and 7a, b that ICA + ABC produces the best AUC scores with the smallest number of genes compared to the mRMR + ABC algorithms.

  4. 4.

    Furthermore, the results produced by ICA with a conventional feature selection method such that signal-to-noise ratio (SNR), fuzzy algorithm and also with other similar bio-inspired algorithm such that PSO and GA shown in Table 8. Result shows that the proposed ICA + ABC method obtained the best classification accuracy among the other gene selection methods with three datasets of binary classification and one dataset of multi-classification. For the other two cancer data, ICA + GA algorithm obtained the highest accuracy for high-grade glioma data, while ICA + fuzzy algorithm obtained the highest accuracy for lung cancer II data. Also, it was noticeable that ICA + GA and ICA + fuzzy algorithms obtained the best classification accuracy, but ICA + ABC chooses the smallest number of genes for all six cancer datasets of a microarray for best classification accuracy.

    Table 8 Classification performance of the proposed algorithm compared with some conventional and bio-inspired algorithms when combined with ICA for NB classifier on six microarray datasets
  5. 5.

    SVM is also a popular supervised machine learning algorithm of data classification, and various studies showed that SVM outperforms others for microarray data classification. Therefore, we also implement the proposed approach with SVM classifier and compare the result of SVM and NB classifier for six datasets. It can be seen from Table 9 that NB classifier gives slightly better result than SVM classifier with proposed approach but for lung cancer data II SVM gives better classification accuracy than NB classifier.

    Table 9 Classification performance of SVM and NB classifiers with the proposed algorithm on six microarray datasets

To compare the performance of ICA + ABC and mRMR + ABC algorithms, a statistical hypothesis test was employed to determine with a certain level of confidence whether there exists a significant difference between them. A parametric paired t test was applied with α = 0.05 to check whether the average difference in their performance over the problems is significantly different from zero (Derrac et al. 2011). A paired t test carries out a pairwise comparison of the performance of two algorithms. This test evaluates the statistical significance of the null hypothesis; i.e., if the results of two algorithms come from the same observation, hypothesis is rejected when the p value reported by the test is smaller than the significance level (α). If calculated t value is greater than the tabulated value of the distribution of the t test for (n-1) degrees of freedom, the null hypothesis is rejected which means that a given algorithm outperforms the other one with the associated p value (Zar 1999).

From Table 10, it is clear that the proposed approach ICA + ABC not only shows an improvement over mRMR + ABC but also with the other four methods, i.e., ICA + fuzzy, ICA + SNR, ICA + PSO and ICA + GA algorithms with a level of significance α = 0.05. Here, the value of h = 1 represents that the null hypothesis is rejected while h = 0 shows that null hypothesis is accepted. Statistical significance difference between ICA + ABC algorithm other five algorithms is easily seen by the rejection of null hypothesis, as the reported p value in all the cases is less than 0.05 that is depicted in Fig. 8. MATLAB software version (R2014a) is used for the computation of the p value and t value for t tests.

Table 10 T test results of ICA + ABC over mRMR + ABC, ICA + fuzzy, ICA + SNR, ICA + PSO and ICA + GA algorithms with a level of significance α = 0.05
Fig. 8
figure 8

T test comparing results of the proposed algorithm over other algorithms. The red line marks the significance level α = 0.05 (color figure online)

Significantly, it can be said that ABC wrapper approach in ICA feature vector improves the accuracy of NB classifiers by discarding the inappropriate genes while mRMR + ABC picked a bit larger gene subset that gives improved classification accuracy compared to ICA + ABC for all the six datasets of the microarray. There are two main reasons that the performance of ICA + ABC was better than mRMR + ABC feature selection algorithms, because ICA is capable of accurately extracting independent component in the first stage that satisfied the classification criteria of NB classifier. The second cause is the novel hybrid search method (extraction approach with the random search wrapper approach) integrated in ICA + ABC. The other researchers have not used such a hybrid search method for NB classification. The benefit of adopting such type of the hybrid search technique can clearly be seen in Fig. 9. Therefore, ICA + ABC had a significant ability to generate improved classification accuracy of a NB classifier for different microarray datasets, using the smaller number of genes.

Fig. 9
figure 9

Average error rate of NB classifier for the six datasets with different gene selection methods when combined with ICA

6 Conclusion

To improve the classification accuracy of NB classifier, an improved gene selection method based on ABC and ICA is proposed in this paper. Comparative performance shows, with smaller subsets of genes selected by the proposed method, NB classifier achieves higher classification accuracy on six benchmark cancer classification datasets of the microarray compared to other previously proposed methods, which shows the efficiency and effectiveness of the proposed gene selection method. Therefore, this work suggests two-stage gene selection method (ICA + ABC) that can select best gene subsets to achieve higher classification accuracy of a NB classifier for classification of the microarray.