Keywords

1 Introduction

Classification is one of the main tasks in machine learning and data mining, and has been successfully applied to many areas such as computer science, engineering and biology. Moreover, classification has been continuously received a great attention. However, there are still open issues in classification, and one of the issues is classification with incomplete data [6, 9].

Incomplete data is data which contains some fields without values. Missing values are a common problem in many datasets. For example, 45% of the datasets in UCI machine learning repository, which is one of the most popular collection of benchmark datasets for machine learning, contain missing values [6]. Reasons for datastes containing missing values are various. For example, social survey sheets often contain missing values because respondents refuse to answer some questions; medical patient records also usually have missing values because all tests often cannot be done on patients [13].

Missing values cause serious problems for classification. One of the most serious problems is the non-applicability of majority classifiers with incomplete data. Majority classifiers require complete data; therefore, they cannot directly work with incomplete data. Moreover, missing values often lead to big classification errors [6, 20].

One of the most popular approaches to solving classification with incomplete data is to use a classifier which can directly classify incomplete data. For example, C4.5 can directly deal with incomplete data in both training and testing process. Although this approach can tackle incomplete data to some extent, it often results in more complex learnt models and bigger classification errors [19]. Therefore, further approaches to improving classifiers able to directly classify incomplete data should be investigated.

Feature selection is the process to select a suitable feature subset from the original features. Feature selection has been proven capable of improving classification accuracy and reducing the complexity of learnt models [14]. Although feature selection is mainly applied to complete data, it is also used to improve classification with incomplete data [18, 22].

Ensemble is a machine learning method that builds a set of classifiers instead of a single classifier for classification tasks. Ensemble methods have been demonstrated to enhance classification accuracy [4]. One of the most popular ensemble methods is bagging. Although bagging helps improve classification accuracy, it often generates more complex learnt models [17]. Moreover, bagging is mainly applied to complete data. Therefore, researches on improving bagging for classification with incomplete data should be investigated.

1.1 Research Goals

The goal of this paper is to propose a new method which improves bagging for classification with incomplete data. In order to achieve the goal, a combination of bagging and feature selection is proposed to classify incomplete data. The proposed method is compared with three benchmark methods for classification with incomplete data. The first benchmark method is to use a classifier which can directly classify incomplete data. The second benchmark method is to combine feature selection and a classifier which can directly classify incomplete data. The third benchmark method is to combine bagging and a classifier which can directly classify incomplete data. The experimental results are used to address the following objectives:

  1. 1.

    Whether the combination of bagging and feature selection can achieve better classification accuracy compared to the other methods for classification with incomplete data.

  2. 2.

    Whether the combination of bagging and feature selection can generate less complex learnt models compared to bagging with all features for classification with incomplete data.

1.2 Organisation

The rest of this paper is organised as follows. Section 2 outlines related work. Section 3 presents the proposed method. After that, Sect. 4 presents comparison method and experiment design. Section 5 shows results and analysis. Finally, Sect. 6 presents conclusions and future work.

2 Related Work

This section outlines related work including classification with incomplete data, feature selection and ensemble learning.

2.1 Classification with Incomplete Data

There are two main approaches to classification with incomplete data. One approach is to use imputation methods to transform incomplete data to complete data before using classification algorithms. The other approach is to use classification algorithms which can directly work with incomplete data without using imputation methods [6].

The purpose of imputation methods is to replace missing values with plausible values. For example, mean imputation fills all missing values in each feature with the average of all complete values in the same feature. The main benefit of using imputation methods is that they can provide complete data which can be used by any classification algorithm. However, simple imputation methods such as mean imputation often lead to the big classification error. Moreover, more sophisticated imputation methods are often computationally intensive to estimate missing values before using classification algorithms [20].

The majority of classification algorithms cannot directly work with incomplete data. However, there are some classification algorithms which are able to directly classify incomplete data. For example, C4.5 [19] use a probabilistic approach to tackle missing values in both the training set and test set. The main benefit of using classification algorithms able to directly classify incomplete data is that the classification algorithms do not require any time for estimating missing values. However, when the classification algorithms work with incomplete data, they often generate more complexed models and lead to large classification errors [20]. Therefore, further approaches to improving the classification algorithms should be investigated.

2.2 Feature Selection

Feature selection is the process of selecting a relevant subset of features from the original features. The underlying reason for using feature selection is that the data often contains redundant/irrelevant features, which should be removed without much loss of information. By removing redundant/irrelevant features, feature selection can help improve classification accuracy. Moreover, thanks to providing a smaller number of features, feature selection can help to speed up the training process and make simpler learnt classifiers which are easier to interpret [1, 14, 24].

Feature selection includes two main procedures: a search procedure and an evaluation procedure. The search procedure is used to search feature subsets while the evaluation procedure is used to measure the quality of feature subsets. The performance of feature selection strongly depends on the quality of both of the procedures [1, 14, 24].

Search methods for feature selection can be categorised into traditional search methods and evolutionary search methods. For example, sequential forward selection and sequential backward selection are two common traditional search methods for feature selection. In recent times, evolutionary algorithms have been successfully used as search methods in feature selection. Genetic algorithms and particle swarm optimisation (PSO) are two popular evolutionary search methods for feature selection [24].

Evaluation methods for feature selection can be categorised into wrapper methods and filter methods. A wrapper method uses a classifier to evaluate feature subsets while a filter method uses a measure such as information gain to evaluate. In order to evaluate each feature subset, wrapper methods need to train a classifier and then test its performance; therefore, they are often computationally expensive. In contrast, evaluation measures in filter methods are often computationally cheap; therefore, filter methods are often more efficient and general than wrapper methods. However, wrapper methods are often more accurate than filter methods [1, 14, 24].

Particle swarm optimisation (PSO) is a swarm intelligent algorithm proposed by Kennedy and Eberhart in 1995 [10]. Recently, PSO has been widely used as a search method for feature selection. Continuous PSO is usually used for feature selection, where the dimensionality of each particle is equal to the total number of features and a threshold \(\theta \) is often used to determine whether or not a feature is selected. If the value is smaller than \(\theta \), the corresponding feature is not selected, otherwise, it is selected. PSO has been used for both wrapper and filter. In a PSO-based wrapper, a classifier is required to evaluate particles, and the fitness of each particle is the accuracy of the classifier by using selected features. In a PSO-based filter, an evaluation measure is required to evaluate particles, and the fitness of each particle is estimated by the measure with selected features [24].

Feature selection methods have been mainly applied to complete data. However, in recent times, there are some feature selection approaches to incomplete data. In [5], the mutual information criterion which is based on k-nearest neighbours is expanded to tackle with missing values. The experimental results show that the method can select important feature subsets without using any imputation method and help enhance the performance of the prediction models. In [18], a combination of the mutual information measure and rough sets is proposed to evaluate feature subsets with incomplete data. The empirical results show that the proposed method is effective for selecting feature subsets with incomplete data. A wrapper-based feature selection for incomplete data is proposed in [22], where PSO is used as a search technique and C4.5 which is able to directly classify incomplete data is used to evaluate feature subsets. The experimental results show that the proposed methods not only can improve the accuracy of the classifier, but it also can reduce the complexity of the classifier.

2.3 Ensemble Learning

Ensemble learning is a machine learning method which constructs a set of classifiers for a classification task. It classifies a new instance by voting the decision of individual classifiers. The set of classifiers has been proved capable of achieving higher accuracy than any of the individual classifiers [4].

An ensemble of classifiers is accurate if the individual classifiers in the ensemble is accurate and diverse. Two popular methods to construct accurate ensembles are Bagging and Boosting. Both of the methods use “resampling” techniques to construct different training sets for each of the classifier. Bagging manipulates the original training data by randomly drawing with replacement instances. Consequently, some of the original instances might appear multiple times in the resulting training data while others might disappear. Bagging is usually helpful with “unstable”classification algorithms like neural networks and decision trees where small changes in the training data often result in major changes in predictions. Experimental results reveal that Bagging ensemble almost always achieves better accuracy than a single classifier. Boosting also manipulates the original training data by drawing with replacement, but it uses the performance of the previous classifier(s) to calculate the probability of selecting each instance. Boosting tries to construct new classifiers that are better to classify instances for which the current ensemble’s performance is poor. Therefore, in Boosting, instances which are incorrectly classified by previous classifiers are more often selected than instances which are correctly classified. Experimental results reveal that with little or no classification noise, Boosting ensemble also almost always achieve better accuracy than a single classifier, and it is sometimes more accurate than Bagging ensemble. However, with substantial classification noise, Boosting ensemble is often less accurate than a single classifier since Boosting ensemble often overfits noisy datasets [16].

Feature selection also has been used to improve ensemble learning. In [17], a genetic algorithm (GA) is used to search a suitable set of feature subsets for ensemble. Initially, a set of classifiers is generated where each classifier is built by randomly picking a set of features. After that, new classifiers are created by using the genetic operators. Finally, the best fit individuals are chosen to build an appropriate set of feature subsets which is then used to create an ensemble. Neural network is used as a classifier and the fitness of each individual is the combination of accuracy and diversity. Experiment results show that the feature selection ensemble can achieve better accuracy than the popular and powerful ensembles of Bagging and Boosting. In [7], GA is also applied to search a set of feature subsets for ensemble, where the GA runs multiple times with different training data to provide different feature subsets. C4.5 and Euclidean Decision Tables are used as classifiers and the fitness of each individual is the classification accuracy. Experiments show that the feature selection ensemble is more accurate than other existing ensemble methods. In [15], a multi-objective GA is used to select a set of feature subsets which is used to construct a set of classifiers. After that the multi-objective GA is used again to select an optimal set of classifiers. The experimental results show that the proposed method is more effective than Boosting and Bagging for the handwriting recognition problem.

Ensemble learning also has been used to solve classification with incomplete data. In [11], a set of classifiers is constructed to classify incomplete data, where each base classifier is trained with a random subset of features. In [2], incomplete data is firstly grouped into complete subsets, and then each subset is used to train one classifier. Although the two methods can tackle incomplete data in some extent, they cannot ensure to classify all incomplete instances.

Popular ensemble methods such as bagging/boosting have been mainly applied to complete data. Therefore, the application of popular ensemble methods for incomplete data should be investigated. Moreover, combining popular ensemble methods with feature selection has not been investigated. Feature selection can improve the performance of classification with incomplete data. Therefore, a combination of popular ensemble methods with feature selection for incomplete data also need be investigated.

3 The Proposed Method

The key idea of the proposed method is that bagging is combined with feature selection to improve the accuracy and diversity of a set of learnt classifiers. The underlying reason is that to construct a set of classifiers, bagging repeatedly resamples the training dataset to build a set of training resampled datasets. The resampled datasets often contain redundant/irrelavant features. Moreover, feature selection has been proven capable of remove redundant/irrelavant features. Therefore, feature selection could be applied to each resampled dataset to eliminate redundant/irrelavant features. By eliminating redundant/irrelavant features, feature selection can help improve resampled datasets which in turn can help build more accurate and less complex learnt classifiers.

Fig. 1.
figure 1

The training process of classification with incomplete data by combining bagging and feature selection.

Figure 1 shows main steps of the training process of the proposed method. In the training process, firstly, the training dataset is put into a resampling procedure several times to generate a set of training resampled datasets. After that, each training resampled dataset is put into a feature selection procedure to select a suitable feature subset which is then used to transform the training resampled dataset into the training selected dataset. Subsequently, each the training selected dataset is used by a classification algorithm to learn a classifier. As a result, the training process generates a set of classifiers. In the application process, the set of classifiers is combined to classify a new instance.

The main steps of the proposed method are presented in the following subsections.

3.1 Resampling Data

The purpose of each time resampling training dataset is to create a random redistribution of the training dataset. Each training resampled dataset is generated by randomly choosing with replacement the same number of instances in the original training dataset. As a result, many of the original instances might be repeated in the training resampled dataset while others might be left out.

3.2 Feature Selection

The key difference between the proposed method and bagging is this step. Bagging immediately uses a set of training resampled datasets to build a set of classifiers. In contrast, the proposed method applies feature selection to eliminate redundant/irrelevant features in each training resampled dataset before building a set of classifiers.

In order to remove redundant/irrelevant features in incomplete data, any search technique can be used to find feature subsets. However, to evaluate a feature subset which may contain incomplete features, the feature selection procedure requires a feature subset evaluation method which can deal with incomplete data. In [22, 23], a combination of PSO and a classifier able to classify incomplete data has been successfully used to remove redundant/irrelevant features in incomplete data. Therefore, in the proposed method, PSO will be used to search feature subsets and a classifier which is able to classify incomplete data such as C4.5 will be used to evaluate feature subsets.

3.3 Combining Classifiers

A set of classifiers which is built in the training process is combined to classify new instances in the application process. The majority vote chooses a class label with the most votes from the ensemble members as the ensemble output. The majority is a simple and powerful voting method [16]. Therefore, in the proposed method, the majority vote will be used to combine classifiers.

4 Method and Experiment Design

This section shows the comparison method and experiment design including datasets used in the experiment, parameter settings for feature selection and classification algorithm.

4.1 The Comparison Method

Experiments are conducted to evaluate the effectiveness of the combination of bagging and feature selection for classification with incomplete data. In order to achieve the goal, a combination of bagging and feature selection for classification with incomplete data as shown in Fig. 1 is compared with three other common methods for classification with incomplete data as shown in Figs. 2, 3 and 4. Figure 2 shows the training process of classification with incomplete data by using a classifier able to directly classify with incomplete data. Figure 3 shows the training process of classification with incomplete data by combining feature selection and a classifier able to directly classify with incomplete data. Figure 4 shows the training process of classification with incomplete data by combining bagging and a classifier able to directly classify with incomplete data.

Fig. 2.
figure 2

The training process of classification with incomplete data by directly using a classifier able to classify incomplete data.

Fig. 3.
figure 3

The training process of classification with incomplete data by using feature selection.

Fig. 4.
figure 4

The training process of classification with incomplete data by using bagging.

In the four setups, firstly, incomplete dataset is divided into training dataset and testing dataset. In the proposed setup shown in Fig. 1, the training dataset is used by bagging and feature selection to build a set of classifiers which is used to classify the testing dataset. In the setup shown in Fig. 2, the training dataset is directly put into a classification algorithm to learn a classifier which is then used to classify the testing dataset. In the setup shown in Fig. 3, the training dataset is put into a feature selection procedure to select a suitable feature subset which is used to transform the training dataset into the training selected dataset. After that, the training selected dataset is used by a classification algorithm to learn a classifier which is used to classify the testing dataset. In the setup shown Fig. 4, the training dataset is put into a resampling procedure to generate a set of training resampled dataset. After that the set of training resampled dataset is used by a classification algorithm to build a set of classifiers which is then used to classify the testing dataset.

4.2 Datasets

The experiments used eight benchmark incomplete datasets chosen from UCI Repository of Machine Learning Databases [12]. Table 1 presents main characteristics of these datasets which include the name, the number of features (R:real/I: integer/N:nominal values), the number of classes, the number of instances and the percentage of incomplete instances which contain at least one missing field.

Table 1. Datasets used in the experiments.

These datsets were carefully chosen to represent classification tasks with incomplete data of varying difficulty, dimensionality, feature types and number of classes. These datasets have varying levels of incomplete instances (Cleveland has 1.98% incomplete instances while Hepatitis has 48.39% incomplete instances). These problems also range low and high dimensionality (Mammographic has five features while Dermatology has 34 features), different feature types and binary classification and multi-class classification.

None of the datasets has a specific test set. Furthermore, the number of instances in some datasets is relatively small. Consequently, the ten-fold cross-validation method was applied to measure the performance of the learnt classifiers. In the experiments, with each dataset, the ten-fold cross-validation method was done 30 times. Therefore, with each dataset, 300 pairs of training set and test set were generated to evaluate the performance of the algorithms.

4.3 Classification Algorithms

In the experiments, we used C4.5 [19] and REPTree [21] to classify data and evaluate the quality of feature subsets in the feature selection procedure. Both of the algorithms are able to directly classify incomplete data. The WEKA [8] was used to implement the algorithms by setting its parameters as the default values. Following [16], in the proposed method and the original bagging method, the number of classifiers was set 10.

4.4 PSO Parameter Settings

The experiments used continuous PSO to search feature subsets in the feature selection procedure. The PSO parameters were set as common parameter settings proposed by Clerc and Kennedy [3]. The detailed parameter settings are shown as follows: inertia weight (\(\omega \)) was set 0.729844, the cognitive parameter (\(c_{1}\)) and the social parameter (\(c_{2}\)) were set 1.49618, population size was set to 30, the maximum iteration was set to 50 and the fully connected topology. The classification accuracy was used to evaluate the quality of particles. The threshold \(\theta \) was set 0.8 to determine whether or not a feature is selected.

5 Results and Analysis

This section shows the comparison of the proposed method with the other methods on classification accuracy and the complexity of learnt classifiers. This section also mentions some reasons for the experimental results.

5.1 Classification Accuracy

Table 2 shows the average of classification accuracy and standard deviation of the four methods with the two classifiers on the eight incomplete datasets. The average of classification accuracy is the average of accuracies of 30 times performing ten-fold cross-validation on each dataset. In the table, BaFS column presents the average of accuracy from the proposed setup shown in Fig. 1; All column presents the average of accuracy from the setup shown in Fig. 2; FS column presents the average of accuracy from the setup shown in Fig. 3, and BaAll column presents the average of accuracy from the setup shown in Fig. 4.

To compare the performance of the proposed method with the other methods, the Wilcoxon signed-ranks tests at 95% confidence interval is used to compare the classification accuracy achieved by BaFS with the other methods. “T”columns in Table 2 show significant test of the columns before them against BaFS, where “+”, “=”and “−” mean BaFS is significantly more accurate, not significantly different, and significantly less accurate, respectively. The bold ones mean the best results for each dataset.

Table 2. The classification accuracy of different methods.

It is clear from Table 2 that in most cases, the proposed method can achieve the best classification accuracy. In all cases, the proposed method achieves significantly better classification accuracy than using all features as shown in Fig. 2. In all cases, the proposed method also achieves significantly better classification accuracy than using selected features as shown in Fig. 3. Compared to bagging as shown in Fig. 4, in 16 cases, the proposed method achieves significantly better classification accuracy than bagging in 10 cases, similar classification accuracy to bagging in 6 cases and never significantly worse classification accuracy than bagging.

In summary, the combination of bagging and feature selection is able to help significantly improve accuracy of classification with incomplete data.

5.2 Classifier Size

Table 3 shows the average size of decision trees(the number of nodes in the trees) of the four methods with two classifiers on the eight incomplete datasets. The average size of decision trees is the average size of 30 times performing ten-fold cross-validation on each dataset. In the table, BaFS column presents the average size from the proposed setup shown in Fig. 1; All column presents the average size from the setup shown in Fig. 2; FS column presents the average size from the setup shown in Fig. 3, and BaAll column presents the average size from the setup shown in Fig. 4.

Table 3. The tree size of different methods.
Fig. 5.
figure 5

Tree size ratio between the other methods and BaFS

Figure 5 summaries Table 3 by showing the average of tree size ratio between the other methods and the proposed method with C4.5 and REPTree (bigger than one means bigger tree, otherwise equal or smaller tree). It can be seen from Fig. 5 that the feature selection with original data shown in Fig. 3 provides the smallest trees. Moreover, it is clear from Fig. 5 that with both C4.5 and REPTree, the average of tree size generated by bagging with all features is bigger than the combination of bagging and feature selection. In other words, the combination of bagging and feature selection helps reduce the complexity of learnt classifiers from bagging.

In summary, the combination of bagging and feature selection not only help improve classification accuracy of bagging, but also helps reduce the complexity of the learn classifiers compared with the standard bagging method.

5.3 Further Analysis

To understand how the combination of bagging and feature selection can achieve better classification and smaller trees than bagging with all features, we looked carefully at the trees generated by C4.5 bagging with all features and bagging with selected features on Hepatitis dataset which has 19 features (Age, Sex, Steroid, Antivirals, Fatigue, Malaise, Anorexia, LiverBig, LiverFirm, SpleenPalpable, Spiders, Ascites, Varices, Bilirubin, AlkPhosphate, Sgot, AlbuMin, ProTime, Histology). The Hepatitis dataset was chosen since the trees generated on Hepatitis are not too big to analyse. Figures 6 and 7 present two typical pattern trees we observed.

Fig. 6.
figure 6

Left tree with 90.0% of accuracy generated by C4.5 bagging with all features and right tree with 92.14% of accuracy generated by C4.5 bagging with selected features

Fig. 7.
figure 7

Left tree with 86.42.0% of accuracy generated by C4.5 bagging with all features and right tree with 91.42% of accuracy generated by C4.5 bagging with selected features

It is clear from Figs. 6 and 7 that the combination of bagging and feature selection can generate more accurate and less complex trees than bagging with all features. The reason might be that classifiers like C4.5 are greedy algorithms which make the locally optimal choice at each stage. Therefore, they may provide locally optimal solutions. The purpose of feature selection is to search for more suitable feature subsets. Therefore, feature selection can help reduce the limitation of greedy algorithms. For example, in Fig. 6, with a training resampled data, when C4.5 bagging uses all features, the information gain of feature Spiders and feature Sex are higher than the information gain of feature Age, so feature Spiders and feature Sex are chosen to build the right tree before choosing feature Age. When feature selection is applied to the training resampled data, only three features Age, Ascite and LiverBig are selected. Consequently, feature Age is chosen to develop the right tree instead of feature Spiders or feature Sex. As a result, bagging with selected features generates more accurate and less complex trees than bagging with all features.

It also can be seen from Figs. 6 and 7 that the combination of bagging and feature selection can generate more diverse trees than bagging with all features. For example, the right tree on Fig. 6 uses the same feature Ascites in the first level as the right tree on Fig. 7. However, the left tree on Fig. 6 uses different feature in the first level from the right tree on Fig. 7. By generating more diverse trees, feature selection helps improve the bagging method.

In summary, bagging with selected features can generate more accurate, less complex and more diverse learnt models than bagging with all features. Therefore, the combination of bagging and feature selection helps improve the traditional bagging method.

6 Conclusions and Future Work

This paper proposed a combination of bagging and feature selection method to improve classification with incomplete data. In order to achieve the purpose, bagging is firstly used to construct a set of training resampled data. After that, the set of training resampled data is used by a wrapper-based feature selection to build a set of training selected data which is then used to learn a set of classifiers. The proposed method was compared with three other popular classification methods which can directly work with incomplete data. The experiments on eight incomplete datasets used C4.5 and REPTree, which is able to directly classify incomplete data, as classifiers and PSO as a search technique in feature selection. The results showed that the combination of bagging and feature selection method is more accurate than the other methods. Moreover, the combination of bagging and feature selection is able to reduce the complex learnt models generated by the bagging method.

One of the other most popular ensemble methods is boosting. We already tried to use the same process as the proposed method to combine boosting and feature section for classification with incomplete data. However, experimental results were not promising. Therefore, future work could be to investigate more suitable approaches to combining boosting and feature selection, and check which ensemble methods are more suitable to what kinds of problems.