Keywords

1 Introduction

Micro-array data is now used in many fields of medical diagnosis that is used for the detection of breast cancer, lymphoma, leukemia, etc. In order to measure the changes in expression levels of huge number of genes, micro-array data is used. Classification is a supervised learning process used for predicting a class label to any unseen data on the basis of training set of data, whose class label is already known. Nowadays, many existing classifiers such as SVM, k-nearest neighbor, ANN, Bayesian classifier, decision tree, linear regression are present. Commonly, a single classification method is not sufficient enough to correctly identify the class level. An ensemble technique is a supervised learning algorithm technique which combines a group of models in order to obtain an overall model with more precise decisions [1]. The models prediction, classification performance is usually improved by using the ensemble techniques.

Hence instead of choosing just one model, if we combine the outputs of different models, then the risk of selection of a badly performing classifier can be reduced. Several ensemble methods are there like voting, bagging, boosting, Bayesian merging, stacking, distribution summation, Dempster–Shafer, density-based weighting [2, 3]. This work mainly contains various classification and ensemble strategies, the set of laws for selecting the reduced data from large data sets, the act of using different classification techniques, how the classification and ensemble technique can be applied over different gene expression data sets. Here, stacking is used as an ensemble technique; that is, it combines the decisions of the individual classifier by using majority voting fusion rule. Stacking is concerned with combining multiple classifiers obtained by using various learning algorithms on a particular data set [4].

Finally, a comparison is done among different base classifiers and ensemble methods, and it was found that the ensemble methods were demonstrated with much better performance.

The rest of the paper is organized as follows: the basic definition of classifier ensemble is described in Sect. 2. Section 3 depicts the model. Section 4 explains the general methods, concepts, and approaches that are used to find out the result. Section 5 describes the two different ensemble techniques that are used to improve the result. Through simulation on variety of datasets, the result of the proposed model is reported in Sect. 5.

2 Classifier Ensemble Analysis

Classification is prediction of a certain result based on a given input. A training set containing a set of attributes and the result, usually called goal or prediction is being processed in order to predict the result. Classification in other words is a data mining function that assigns items in a group to mark categories or classes. Generally Classification is a process of estimating to which of a set of examples a new example belongs to, on the basis of a training dataset, whose class label is already known [2]. The algorithm that implements this process is known as classifier, which is a mathematical function that maps a data to a category.

In general, the single classification technique is not sufficient enough to identify the class level properly. An ensemble is itself a supervised learning algorithm which combines a set of models in order to obtain a global model with more accurate and reliable decisions [1]. When more number of algorithms is used in a model it becomes expensive. Therefore, nowadays, the researchers are emphasizing on the ensemble techniques. These techniques use to reduce the error rate in classification tasks in comparison with single classifiers. Also, the amalgamation of various techniques to make a final conclusion makes the performance of the system more strong against the difficulties that each individual classifier may have on each data set. Ensemble is mainly done to improve the accuracy and efficiency of the classification system.

3 Proposed Model

As mentioned earlier, this work focuses on the second phase of the model, that is, classifier ensemble techniques. In phase one, random selection method is used for training and testing of data. Here, we have used three classifiers, namely naive Bayes, backpropagation neural network, and support vector machine. In the second phase, k-fold cross-validation technique is used to divide the data set into training and testing. The value of k depends on the data set. Then training and testing is done up to k times for all the classifiers iteratively and then classifier fusion technique that is Stacking and Majority Voting are used to combine the outputs of the individual classifiers (Fig. 1).

Fig. 1
figure 1

Proposed model

4 Concepts, Methods, and Approaches

Initially, datasets need to be normalized. Data transformation such as normalization is a data preprocessing tool which is used in data mining system in order to remove the noisy data. An attribute of a dataset is getting normalized by scaling its values so that they fall within a specified range, such as 0.0–1.0 [5, 6]. Normalization is mostly useful for classification algorithms and clustering technique. Here, Min–Max normalization is used as a tool for preprocessing. Here, min A and max A are the minimum and maximum values of an attribute A. This technique can be calculated by using

$$ V = (v - \hbox{min} \,A)/(\hbox{max} \,A - \hbox{min} \,A). $$
(1)

After normalization, data reduction is done using PCA. The presence of large data sets can cause rigorous problems in an organization’s decision support systems and database management systems. Micro-array data is high-dimensional data which can cause significant problems such as irrelevant genes, difficulty in constructing classifiers, and multiple missing gene expression values. In this paper, we have employed principal component analysis (PCA) as the feature reduction technique to extract the needful features, which can be used to train the classifiers. This feature reduced dataset is expected to provide a better classifier in terms of accuracy and efficiency. PCA is defined as a feature extraction method that transforms the data to a new coordinate system that is known as orthogonal linear transformation in such a way that by any projection of the data, the maximum variance comes to lie on the first coordinate that is known as the first principal component, then on the second coordinate lies the second largest variance and so on [7].

After feature reduction, the reduced data set is used for training by applying various classifiers like backpropagation neural network, support vector machine, and naive Bayes.

Backpropagation is learning or training algorithm rather than the network itself. A backpropagation learns by example. BPNN is a neural network learning algorithm that performs learning on multilayered feed-forward neural network. The training is completed by providing the input to the network, and the networks’ weights are changed so that it will give us the required output for a particular input. In order to train the network we need to give the network examples of what we want the output (known as the Target) for a particular input. The weights are modified for each training data in order to reduce the error between the network’s prediction and actual target value. Since the modifications are made in backward direction that is from the output layer to the hidden layer, hence, it is called backpropagation [8].

A naive Bayes classifier is defined as a probabilistic classifier that is based on applying Bayes theorem with some independence assumptions. In plain terms, a naive Bayes classifier assumes that the value of an individual feature is unrelated to the occurrence or lack of any other feature, provided with the class variable. An advantage of the naive Bayes classifier is that it only requires a small amount of training data to guess the parameters, that is, mean and variance of the variables that are necessary for the classification [9]. The Bayesian classification assumes a basic probabilistic model, and it allows to capture uncertainty about the model in a disciplined way by determining probabilities of the outcomes. It calculates explicit probabilities for hypothesis, and it is robust to noise in input data. Bayes theorem provides an approach to update the probability distribution of a variable based on information newly available by calculating the conditional distribution of the variable given the new information. The updated conditional probability distribution provides the new level of certainty about the variable. Posterior probability is calculated by updating the prior probability by using Bayes theorem. It uses the knowledge of prior events to predict the future events [10, 11]. Bayes theorem says:

$$ P\left( {\frac{\theta }{Y}} \right) = P(\theta )^{*} P(Y/\theta )/P(Y), $$
(2)

where P(θ) and P(Y) are the unconditional distributions of θ and Y. \( P\left( {\frac{\theta }{Y}} \right) \) is the posterior distribution of θ.

\( P(Y/\theta ) \) is the likelihood function, and it measures how closely Y is distributed around θ.

SVM is used as a mapping function that transforms data in input space to data in feature space in a linearly separable manner [12, 13]. In machine learning, support vector machines (SVMs) are supervised learning models with associated learning algorithms, which analyze data and recognize patterns used for classification [14]. A support vector machine represents points in space, where the examples can be separated into distinct categories by a clear wide gap. Based on their category, new groups are being classified into one of those groups. In order to transform the original training data into a higher dimension, a nonlinear mapping is used. Support vector machines find a hyperplane which would be able to separate both the plane by retrieving the support vectors. SVM separates the hyperplane of class levels +1, −1 that is situated in maximum distance from both the positive and the negative samples. From both the negative and the positive pair, feature vectors are being extracted which are assigned with the class label of +1 and −1 to know whether the pair is a interacting or a non-interacting pair.

5 Classifier Ensemble Methods

An ensemble is itself a supervised learning algorithm which combines a set of models in order to obtain a global model with more accurate and reliable decisions [2, 15]. Classifier combination is one of the most frequently explored methods in data mining in the recent years. These techniques use to reduce the error rate in classification tasks in comparison with single classifiers. Therefore, nowadays, the researchers are emphasizing on the ensemble techniques. In this paper, majority voting and stacking are used on various gene expression datasets.

In majority voting, an unlabeled example is classified in accordance with the class that obtains the highest number of votes. It can be represented as follows:

$$ {\text{Class}}(X) = \mathop {\arg \,\hbox{max} }\nolimits_{{c_{i} \in {\text{dom}}(y)}} \sum {\forall kc_{i} } = \mathop {\arg \,\hbox{max} }\nolimits_{{c_{i} \in {\text{dom}}(y)PM_{k} }} \left( {y = \frac{{c_{j} }}{x}} \right), $$
(3)

where \( M_{k} \) denotes the classifier k and \( _{{PM_{k} }} \left( {y = \frac{{c_{j} }}{x}} \right) \) denotes the probability of y obtaining the value of c at an instance x [16, 17].

Stacking is an ensemble method that is used for achieving the highest generalization accuracy. The reliability of the classifiers is judged on the basis of the meta-learner which learns from the outputs of the base learners. It uses the results of the base classifiers to produce a new record on which we need to apply a second learning algorithm [4]. This method allows us to maximize the utilization of the information contained in the training dataset. Normally to form a meta-learner training set, we divide the original training set into k disjoint subsets of equal size that is known as k-fold cross-validation technique [4, 18]. k will affect the overall accuracy boost and overall cost. The different base classifiers are trained and tested on different partitions of the training data. In the second level, again the classifiers are trained with the new class obtained from first level and the final accuracy is obtained. The results provided by this method were very good. The algorithm says as follows:

  1. 1.

    From the training set T, create k partitions from it and the cross-validation technique is used for all the base classifiers.

  2. 2.

    Machine learning is used to obtain second-level classifier.

  3. 3.

    A new class label is created and again uses the base classifiers to test the data and accuracy is found (Fig. 2).

    Fig. 2
    figure 2

    Stacking technique

6 Results and Discussion

The set of experiments has been carried out using six datasets as shown in Table 1, such as breast cancer, lung cancer, iris, E. coli, yeast from UCI repository.

Table 1 Different datasets used for experimental evaluation

The proposed model has been tested with all the individual classifiers SVM, BPNN, NB, and the ensemble method that is stacking and majority voting for all five bench mark data sets as illustrated in Tables 2, 3 and 4. The threefold cross-validation test had been carried out, and the accuracy is measured. Entire algorithm is written and tested in MATLAB R2010a (Figs. 3, 4, 5, 6 and 7).

Table 2 Accuracy of different datasets using different classifiers using random selection method
Table 3 Accuracy of different datasets using stacking ensemble method
Table 4 Accuracy of different datasets using majority voting ensemble method
Fig. 3
figure 3

Accuracy of classifiers and ensemble methods on breast cancer dataset

Fig. 4
figure 4

Accuracy of classifiers and ensemble methods on lung cancer dataset

Fig. 5
figure 5

Accuracy of classifiers and ensemble methods on iris dataset

Fig. 6
figure 6

Accuracy of classifiers and ensemble methods on E. coli dataset

Fig. 7
figure 7

Accuracy of classifiers and ensemble methods on yeast dataset

7 Conclusion

A comparative study is done between different classifiers and ensemble technique and are trained and tested on various publicly available gene expression datasets. Performance of the different classification methods and ensemble methods has been compared by using the accuracy values. The above ensemble methods that have been used for gene expression data set show that it achieves higher accuracy than all the other individual classifiers. The method also takes less computational time and space than the others.

Further, the accuracy of the ensemble technique can be enhanced much more by adding some optimization technique to the ensemble method.