Keywords

1 Introduction

When solving a problem, human usually ignore the irrelevant details and focus on the major part of the problem, in this way, they can simplify the problem solving. For example, feature selection [1], attribute reduction [2] in knowledge mining, etc. In addition, analyzing problem in several different aspects and then combing their results is another common solution of human problem solving. There are many related researches, such as subspace [3], multiple view learning [4], and so on.

These two ways of problem solving have been widely used in classification problem [5, 6]. First of all, ignore irrelevant information can improve the algorithm execution efficiency. Studies have shown that irrelevant attributes have a negative effect on classification accuracy. Secondly, classifying from several different views and then combine their results is another effective method to improve classification accuracy. However, most of the researches are deal with complete data. At the same time, in many practical applications, missing values are often inevitable due to various reasons. Such as equipment errors, data loss, manual data input, etc. So, classification on incomplete data is very necessary.

To avoid the negative impact of irrelevant attributes on the classification performance, we propose a multiple-side multiple-learner algorithm (MSML) for incomplete data. MSML first uses chi-square statistic evaluation algorithm to delete some unimportant attributes, and then constructs a group of classifiers according to the missing feature values in the selected feature subset. Finally, the results of different classifiers are combined by weighted majority voting.

The rest of the paper is organized as follows. The research on incomplete data classification is briefly reviewed in Sect. 2. In Sect. 3, we introduce the MSML algorithm. Section 4 gives the numerical experiments on 8 real incomplete datasets form UCI Machine Learning Repository. Section 5 concludes the paper.

2 Related Work

Some scholars have studied the classification on incomplete data. There are two simple methods to deal with incomplete datasets. One way is simply ignore the samples with missing values. However, this may cause loss of potential profitable information, leading to an insufficient amount of samples for investigation [7]. Imputation method is another common solution to replace missing values with a particular value of the individual variables. Both methods are known to incur potential estimation bias [8, 9]. One kind of methods can avoid the estimation bias is to use the EM algorithm [10], gradient descent [11], Gibbs sampling [12] or Logistic regression algorithm [13]. But this kind of methods relies on the assumption that data are missing at random and there is no technique to verify this assumption. Meanwhile, this kind of methods will suffer a dramatic decrease in accuracy when this assumption is violated.

To avoid the missing at random assumption, Ramoni and Sebastiani proposed a Robust Bayes Classifier (RBC) [14] that needs no assumption about data missing mechanism. However, similar to Naive Bayes Classifier, RBC also makes the assumption that attributes are independent for each class. Krause et. al [15] introduced an ensemble method to deal with incomplete data, sub classifiers were trained on random feature subsets. The method also assumed that the value of any feature is independent of all others. Chen et.al [16] put forward a noninvasive neural network ensemble (NNNE) method without any assumptions about the data distribution. This method generates a community of base classifiers trained only with known values. But it did not take into account the differences of attribute importance degree. To overcome the limitation, a multi-granulation ensemble method (MGNNE) was proposed [17]. Information entropy was applied to measure attribute importance degree. However, the performance of MGNNE relies on the proportion of samples whit no missing values. Moreover, all the above three methods did not consider the negative effect of irrelevant attributes on classification performance.

Considering the characteristic of incomplete dataset and the negative effect of irrelevant attributes on classification performance. We propose a new algorithm called multiple-side multiple-learner classification algorithm (MSML) to deal with incomplete data.

3 MSML

3.1 Chi-Square Statistic Feature Evaluation Algorithm

We apply chi-square statistic to calculate the importance degree between each attributes and class variable respectively. A feature subset is selected by removing the attributes with cumulative probability distribution (cdf) values smaller than threshold \(\alpha \). We first give the method to construct the contingency table of an attribute variable with respect to the class variable.

Given an incomplete dataset D, suppose A is an attribute of D with m values, d is the class variable with l values. Note, we use ‘?’ to denote the missing (unknown) value. The process of constructing contingency table \(M_{A}\) of attribute A with respect to d can be described as follows:

  • (1) Count the following frequencies:

    \(f_{ij} = f(A = a_i ,d = d_j ),f_{(m + 1)j} = f(A = ?,d = d_j ),\)

    \(f_{i(l + 1)}=f(A = a_i ,d =?)\) and \(f_{(m + 1)(l + 1)}=f(A = ?,d = ?).\)

  • (2) Allocate \(f_{(m + 1)j}\), \(f_{i(l + 1)}\) and \(f_{(m + 1)(l + 1)}\) to \(f_{ij}\).

    To update\(f_{ij}\):

    • (2.1) Compute the following summation:

      \(row_i = \sum \limits _{j = 1}^l {f_{ij} } ,\mathrm{{ }}col_j = \sum \limits _{i = 1}^m {f_{ij} } ,N = \sum \limits _{i = 1}^m {\sum \limits _{j = 1}^l {f_{ij} } }\)

    • (2.2) Update \(f_{ij}\):

      \(f_{ij}^{'} \leftarrow f_{ij} + f_{i(l + 1)} \times \frac{{col_j }}{N} + f_{(m + 1)j} \times \frac{{row_i }}{N} + f_{(m + 1)(l + 1)} \times \frac{{f_{ij} }}{N}\)

  • (3) Obtain the contingency table \(M (M_{ij}=f^{'}_{ij})\)

According to the above steps, we can get all the contingency tables between each attribute and class variable, respectively. Given a contingency table M of attribute A with respect to d, we use the chi-square statistics to measure the importance degree of attribute A. The chi-square attribute evaluation algorithm of incomplete dataset is as follows:

  • (1) Construct the contingency table M (\(m*n\)) of each attribute with respect to class variable, m and n are the number of distinct values (except ‘?’) of attribute A and class variable d, respectively;

  • (2) For a contingency table M of attribute A with respect to d, compute the chi-square statistic Chi(Ad);

    • (2.1) Compute the summation of each row of \(M_A\) denoted by \(r_{i}\) and each column of \(M_A\) denoted by \(c_{j}\), respectively:

      \(r_i = \sum \limits _{j = 1}^n {v_{ij} },(i=1,2,...,m),c_j = \sum \limits _{i = 1}^m {v_{ij} },(j=1,2,...,n);\)

    • (2.2) For each pair of (ij), calculate the expected frequency \(E_{ij}\):

      \(E_{ij} = \frac{{r_i .\mathrm{{ }}c_j }}{N} (i=1,2,...,m;j=1,2,...n;N = \sum \limits _{i = 1}^m {\sum \limits _{j = 1}^n {f_{ij} } })\);

    • (2.3) Compute the chi-square statistic value \(Chi(A,d) = \sum \limits _{i = 1}^m {\sum \limits _{j = 1}^n {\frac{{(E_{ij} - v_{ij} )^2 }}{{E_{ij} }}} }\);

    • (2.4) Compute the cdf value \(P_{A}\) corresponding to Chi(Ad).

  • (3) Select the feature subset \(S_1\) consist of attributes with cdf value bigger than threshold \(\alpha (0 < \alpha < 1)\).

According to the above method, we can get a feature subset \(S_1\) of the incomplete dataset. \(S_1\) consists of attributes with cdf value bigger than a given threshold \(\alpha \). In general, \(S_1\) still have missing values. We will construct a group of classifiers on \(S_1\).

3.2 Multiple-Side Multiple-Learner for Incomplete Data

Let \(D=\{(x_{i},y_{i})\left| i=1,2,...,n\right. \}\) be the incomplete dataset. Where n denote the size of the dataset. Suppose there are d features of the input space \(X = (X^{(1)} ,X^{(2)} ,...,X^{(d)} )\). If a value \(x_i^{(j)}\) of sample \(x_i\) is unknown, it is denoted as \(x_i^{(j)}=null\). For convenience, we first give some definitions as follows:

Definition 1:

For a sample \(x_i\) of D, the missing value set of sample \(x_i\) is defined as a feature subset \(mset\{i\}\) that \(x_i\) is missing for all features in \(mset\{i\}\) and is complete for all features in X but not in \(mset\{i\}\).

\(mset\{i \} = \{ X^{(j)} \left| (\forall X^{(j)}\right. \in mset\{i \} \wedge \mathrm{{x}}_i^{(j)} = null) \wedge (\forall X^{(j)} \notin mset\{i \} \wedge x_i^{(j)} \ne null)\}\).

Definition 2:

The missing attribute set (MS) of D is defined as a set of missing value sets, \(MS=\{MS_1,...,MS_k\}\), in which each missing value set is unique.

Definition 3:

A complete data subsets \(X_{mset_R}\) is defined as a data subset corresponding to a missing attribute set \(mset_R\).

\(X_{mset_R}=\{x_i^{(j)}\left| x_i\right. \in D \wedge \forall j \notin mset_R(x_i^{(j)}\ne null)\}\)

Note, each complete data subset corresponding to a unique feature subset (or missing attribute set) of the incomplete dataset.

To improve the algorithm performance, each complete data subset is used to train a classifier based on bagging algorithm. For a test sample, the algorithm chooses the classifiers that did not require the missing value of the test sample to predict it. And then weighted majority voting is applied to combine the prediction results of the test sample.

In this paper,to determine the final prediction of test sample, some factors are concerned to realize the weighted majority voting. First, each complete data subset has a unique feature subset with an relevance degree for prediction the class label. Moreover, the sub classifiers trained on complete data subsets have different prediction accuracies, as is commonly agreed that higher testing accuracy tends to have greater prediction accuracy. Besides, the size of complete data subsets are different, it is also a factor need to be considered. Combining these three factors, each available sub classifier is assigned a weight by the following method.

$$\begin{aligned} w_i = \frac{{MI_i \left| X_{mset_i }\right| ACC_i }}{{\sum {MI_i \left| X_{mset_i } \right| ACC_i } }} \end{aligned}$$
(1)

Here \(\left| X_{mset_i}\right| \) denote the size of complete data subset \(X_{mset_i}\), \(MI_{j}\) denote the relevance degree (measured by mutual information) between attributes set and class variable on data subset \(X_{mset_{i}}\), \(ACC_i\) denote the testing accuracy of the \(i_{th}\) sub classifier. Algorithm 1 gives the MSML algorithm.

figure a

4 Experiments

4.1 Experimental Description

To testify the validity of MSML, we carried out experiments on 8 benchmark datasets with missing data from UCI machine learning repository [18]. All our experiments were programming by MatlabR2001a. The implementation was performed on an Intel Core i5 CPU running at 3.2GHz (4CPUs) and 4GB RAM. Table 1 gives the detail information about the datasets used for experiments.

For MSML, MGNNE and NNNE, a faster BP algorithm called Levenberg-Marquardt algorithm which has an efficient implementation provided by Matlab is used in our experiments. The number of input nodes (id) is determined by the number of available attributes on each data subsets, and the number of output nodes (od) is determined by the number of classes. According to the geometric pyramid rule, the number of hidden nodes is \(\sqrt{id*od}\). We evaluate the accuracy using ten-folds cross validation approach where a given dataset is randomly partitioned into ten folds of equal size. For each complete data subset, we apply the bagging algorithm to improve the algorithm performance, and set 10 as the number of replicates [19].

Table 1. Summarization of datasets characteristics

4.2 Experimental Results and Analysis

In our algorithm, the attributes with cdf values smaller than threshold \(\alpha \) will be deleted to avoid the adverse effect of irrelevant attributes on algorithm performance. We choose two datasets Bands and L.cancer to study the relationship between algorithm performance and the threshold. We set the threshold to vary from 0.50 to 0.95 with the interval 0.05. Table 2 and Table 3 report the results.

Table 2. Performance of MSML on Bands with the change of \(\alpha \)

One can see that, with the increase in the number of \(\alpha \), both the number of selected attributes and the algorithm runtime decreased gradually. During the process of \(\alpha \) increased from 0.5 to 0.9, algorithm accuracy is basically unchanged. When \(\alpha =0.95\), the algorithm performance on both datasets has an obvious decline (Bands: about 2 % decline, L.cancer: about 46 % decline). From the experimental results, in this paper, we choose \(\alpha =0.9\) as the threshold to delete unimportant attributes, and thus to improve algorithm efficiency.

Table 3. Performance of MSML on L.cancer with the change of \(\alpha \)

Table 4 gives the accuracy comparison of our algorithm to MGNNE, NNNE and RBC on 8 datasets. We can see that, overall speaking, NNNE has relatively poor performance. RBC has best accuracy on two datasets B.cancer and Heart-h. MSML has best performance on 5 datasets, and MGNNE has a slightly better accuracy than MSML on dataset Vote. It indicates that there are a small amount of relevant attributes been removed from dataset Vote when we set \(\alpha =0.9\). One effective solution is to increase the number of selected attributes by setting a smaller threshold. On four datasets Automobile, Bands, Credit and L.cancer, compared with MGNNE, MSML has a certain improvement on accuracy (\(1\,\% \sim 2\,\%\)). It suggests that the irrelevant attributes has an adverse impact on algorithm accuracy.

Table 4. The average accuracy of the four classifiers

By deleting irrelevant attributes, compared with NNE-based algorithms, the execution efficiency of MSML is greatly improved. Table 5 shows the details of three algorithms MSML, MGNNE and NNNE on 8 datasets. Note that the difference between MGNNE and NNNE is that MGNNE modified the weighted majority voting method of NNNE by applying information entropy to measure the importance degree of each sub classifiers. So the number of attributes and the number of data subsets of both methods are equal. Thus, we just list the runtime of MGNNE.

We can see that quite a few irrelevant attributes was deleted on three datasets Bands, Credit and Heart-h, so the number of complete data subsets decreased a lot, thus the algorithm computational time is greatly reduced. At the same time, the runtime of MSML is higher than MGNNE and NNNE on the two datasets Automobile and Mushroom. That is because both datasets has only one attribute was removed, and the number of data subsets is unchanged. However, the chi-square statistic attribute evaluation algorithm is introduced, which increases a certain algorithm execution time. For dataset L.cancer, the algorithm runtime has an apparent decline because its attributes number reduced from 56 to 17. Meanwhile, one can see that there is only one data subsets of MSML, which means that all the attributes with missing values are deleted. Overall, by removing irrelevant attributes, MSML can effectively enhance execution efficiency on the basis of guarantee algorithm accuracy.

Table 5. Runtime, number of selected attributes and number of data subsets

5 Conclusion and Discussion

By removing the irrelevant attributes of dataset, and then building ensemble classifier on the selected attributes set is an effective way to improve algorithm accuracy and execution efficiency. Most current studies require complete data. However, actual datasets are mostly incomplete due to various reasons, thus build classifier can deal with incomplete data is meaningful.

This paper puts forward a multiple-side multiple-learner classification algorithm to deal with incomplete data based on the characteristics of incomplete dataset. MSML first construct the contingency table of all attributes with respect to class variable, and then MSML introduces chi-square statistic evaluation algorithm to select a feature subset by removing the irrelevant attributes. Experiments show that MSML is an effective classification method to deal with incomplete dataset.