Abstract
Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Many real-world datasets have missing or incomplete data [24]. Since the accuracy of most machine learning algorithms for classification, regression, and clustering is affected by the completeness of datasets, processing and dealing with missing data is a significant step in the Knowledge Discovery and Data Mining (KDD) process. Some strategies have been devised to handle incomplete data as explained in [5, 8, 14]. In particular, for regression, where missing data has been more widely studied (e.g. [9]), multiple imputation has shown advantage over other methods [22, 23]. However, much work is still needed to solve this problem in the context of data mining tasks and multiple imputation in particular needs some research to show if it is equally applicable to data mining.
Before we investigate multiple imputation and data mining, which is our long term aim, in this research we want to deliver a thorough understanding of how the different methods for handling missing data affect the accuracy of data mining algorithms when the uncertainty increases, i.e. the amount of missing data increases. We create an experimental environment using the university of California Irvine (UCI) Machine learning repository [13], by removing data from a number of UCI datasets completely at random (MCAR). We select increasing number of attributes at random to remove data from and we also increase the number of records at random from which we remove data in the attributes selected. Therefore, we produce a number of experimental datasets which contain increasing amounts of data MCAR.
Researchers have used a number of different methods to treat missing data in the data preprocessing phase. In this paper, we study the performance of classification algorithms in the context of increasing missing data under different pre-processing scenarios. In particular, we investigate how increasing the amount of missing data affects the performance for complete case analysis, and single imputation for a number of classification algorithms. We also compare that to the performance of algorithms with an internal mechanisms to handle the missing data, such as C4.5, and Random Forest.
The rest of this paper is organised as follows: Sect. 2 presents the problem of missing data and Sect. 3 presents the mechanisms used in Data Mining to address the problem. The methods used in our paper to set up our experimental environment are discussed in Sect. 4. Section 5 analyses the results. A discussion of the results is in Sect. 6. Finally, Sect. 7 presents our conclusions.
2 The Problem of Missing Data
Little and Rubin [14] have defined missing data based on the mechanism that generates the missing values into three main categories as follows: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing not at Random (MNAR). Missing Completely at Random (MCAR) occurs when the probability of an instance missing for a particular variable is independent from any other variable and independent from the missing data so missing is not related to any factor known or unknown in the study. Missing at Random (MAR) occurs when the probability of an instance having a missing value for an attribute may depend on the known values but not on the value of the missing data itself. Missing not at Random (MNAR) occurs when the probability of the instance having a missing value depends on unobserved values. This is also termed a nonignorable process and is the most difficult scenario to deal with. In this paper we focus on generating missing data using the MCAR mechanism. Further work will investigate the other mechanisms.
Horton et al. [9] have further categorized the patterns of missing data into monotone and non-monotone. They state that the patterns are concerned with which values are missing, whereas, the mechanisms are concerned with why data is missing. We can state that we have monotone patterns of missing data if the same data points have missing values in one or more features. We focus in this study on non-monotone missing data.
3 Dealing with Missing Data
In practice, there are three popular approaches that are commonly used to deal with incomplete data:
-
1.
Complete Case Analysis: This approach is the default in many statistical packages but should be only used when missing is under MCAR [14]. All incomplete data points are simply omitted from the dataset and only the complete records are used for model building [14]. The approach results in decreasing the size of data and the information available to the models and may also bias the results [20]. Tabachnick and Fidell [21] assumed that both the mechanisms and the patterns of missing values play a more significant role than the proportion of missing data when complete case analysis is used.
-
2.
Imputation: Imputation means that missing values are replaced in some way prior to the analysis [14]. Mean or median imputation is commonly used with numerical instances and mode imputation with the nominal instances. Such simple imputation methods have been criticized widely [4, 18], because they do not reflect the uncertainty in the data and may introduce bias in the analysis. On the other hand, multiple imputation [17], a more sophisticated method, replaces missing values with a number of plausible values which reflect the uncertainty although the technique may have higher computational complexity. A method for combining the results of the analysis on multiple datasets is also required. For regression analysis, Rubin [17] defined some rules to estimate parameters from multiple imputation analysis. For application to data mining, good methods for pooling the analysis may be required.
-
3.
Model Approach: A number of algorithms have been constructed to cope with missing data, that is, they can develop models in the presence of incomplete data. The internal mechanisms for dealing with missing data are discussed in the context of the algorithms used in this study.
3.1 The Classification Algorithms and Missing Data
We focus on the following well known classification algorithms, some of which have been identified as top data mining algorithms [25]: Decision Trees (C4.5), Naïve Bayes (NB), Random Forest (RF) and Support Vector Machines (SVMs). Further, we will explain how different algorithms and their implementations in Weka, our platform of choice, can treat missing values at both the building and the application phase.
C4.5 is one of the most influential decision trees algorithm. The algorithm was modified by Quinlan [15, 16] to treat missing data using fractional method in which the proportion of missing values of an attribute are used to modify the Information gain and Split ratio of the attribute’s Gain ratio. After making the decision for splitting on an attribute with the highest gain ratio, any instance with missing values of that attribute is split into several fractional instances which may travel down different branches of the tree. When classifying an instance with missing data, the instance is split into several fractional instances and the final classification decision is a combination of the fractional cases [6]. We use the Weka implementation, J48, which uses the fractional method [7].
Naïve Bayes algorithm is based on the Bayes theorem of probabilities using the simplification that the features are independent of one another. Naïve Bayes ignores features with missing values thus only the complete features are used for classification [2, 11]. Therefore, it uses complete case analysis instead of handling missing data internally.
Random Forest is an ensemble algorithm which produces multiple decision trees and can be used for classification and regression. It is considered as a robust algorithm and produces high classification accuracies. This is because random forest splits training samples to a number of subsets then builds a tree for each subset, rather than building one tree [1] and combines their decision. Random Forest, uses the fractional method [1, 10] for missing data in a similar manner to C4.5. The implementation of the algorithm in Weka also uses the fractional method as in C4.5 algorithm.
SVMs are used for binary classification and can be extended to higher dimensional datasets using the Kernel function [19]. SVMs maximize the margin between the separating hyperplane and the classes. The decision function is determined by a subset of training samples which are the support vectors. We use a Weka implementation called SMO (Sequential Minimal Optimization), a modification of the algorithm that solves the problem of Quadratic Programming (QP) when training SVMs in higher dimensions without extra storage or optimization calculations. Although SVMs do not deal with missing values [12], the SMO implementation performs simple imputation by globally replacing the missing values with the mode if the attribute is nominal or with the mean if the attribute is continuous [7].
4 Methods
For our study, a collection of 17 benchmark datasets are collected from UCI machine learning repository [13]. The datasets have different sizes and feature types (numerical continuous, numerical integer, categorical and mixed) as shown in Table 1. None of the datasets have missing values in their original form so this enables us to study how missing data affects the accuracy and performance of classification algorithms.
Data values are then removed completely at random as follows to generate increasing amounts of missing data. First, 10% (then 20%, 50%) of the attributes are randomly selected then missing values are artificially generated by removing values randomly in 5%, 30% and 50% of the records, respectively. As a result, nine artificial datasets are produced for each of the original datasets with multiple levels of missing data. In total, we have 153 datasets. Table 2 summarises the experimental scenarios artificially created.
For testing the models, 10-fold cross-validation was used and performed 10 times. All results reported represent the average of the 10 experiments with 10-fold cross-validation.
In the complete case analysis, all the incomplete records are omitted. This often results in datasets that are too sparse to be used for classification. The datasets that are left with enough records for classification are considered feasible.
To test simple imputation, the numerical attributes are replaced with their mean and the categorical attributes with their mode. Then the produced datasets after imputation are used for classification model building.
We use the classifiers: J48, Naïve Bayes, RandomForest and SMO implemented in Weka with their default options for classifying the data. We use the classification accuracy as a metric for our experiments. To further compare performance of the classifiers, we compute the average of the percentage difference in accuracy between a classifier obtained with the original (complete) datasets and the datasets with increasing missing data as follows:
where \(Acc\_Sce_{i}\) represents the classifier accuracy for a specific scenario, in our experiment we have 9 scenarios, and \(Acc\_Org_{j}\) represents the classifier accuracy of the corresponding original dataset.
We perform two different statistical tests when evaluating the performance of classifiers over the datasets as follows:
-
1.
When comparing differences in accuracy for each scenario we first use Wilcoxon Signed Rank test with a significance level at \(\alpha =0.05\).
-
2.
We then compare multiple classifiers over multiple datasets using the method described by Demšar [3], including the Friedman test and the post hoc Nemenyi test which is presented as a Critical Difference diagram, with a significance level of \(\alpha =0.05\).
5 Results
Figure 1 shows the average accuracy of classifiers and standard deviation (as error bars) for each of the original complete datasets along with the baseline majority class model accuracy. Models perform better than the baseline in most of the datasets except Post-Operative Patient, Breast Tissue, Spect, and LSVT Voice Rehabilitation, where default accuracy is similar or slightly better than that obtained by the models. We use the Friedman test for statistical differences. The resulting p-value \({<}\)0.05, so we proceed with Nemenyi test. The Critical Difference diagram for the Nemenyi test is shown in Fig. 2. The Figure illustrates that SMO and RandomForest behave better than J48 and Naïve Bayes although there is no statistical differences within each group.
5.1 Complete Case Analysis
The datasets that are not feasible for classification after removing missing records are marked with whereas the feasible are marked with as shown in Table 3. Datasets are ordered by increasing number of attributes (dimensionality) and then number of records. Only two low dimensional datasets are feasible for classification in all scenarios: Ecoli and Tic-tac-toe. In contrast, datasets with increasing dimensionality are not feasible for classification when increasing the amount of missing data due to widespread sparsity. For example, Hill Valley, UrbanLandCover, Epileptic Seizure Recognition, Semeion, LSVT Voice Rehabilitation, HAR Using Smartphones and Isolet all become mostly infeasible.
Figure 3 illustrates the average accuracy of the classifiers and standard deviation for the datasets that are feasible for classification. In scenario 1, the average and the standard deviation are nearly equal to those on the original data. However, with a decreasing number of feasible datasets, the standard deviation increases and the classifiers’ performance deteriorate as we increase missing data.
Table 4 shows the average \( \%\text {Diff}\) in accuracy between classifiers obtained with the original (complete) data and the datasets with increasing missing data for the different data handling approaches and algorithms. For complete case analysis, the deterioration in accuracy reached more than 18% for J48, RF, and SMO in different scenarios. However, Naïve Bayes behaved better gaining 2% in some scenarios. We do not produce statistical analysis due to the small number of datasets that produce a feasible classification with complete analysis.
5.2 Simple Imputation
Table 4 also shows the average of all the percentage differences in accuracy (%Diff) between a classifier obtained with the original (complete) datasets and the imputed data for each scenario and each algorithm. %Diff increases when missing data increases in all classifiers, however simple imputation performs much better than complete case analysis. Accuracy decreased in a small range between [−0.54,−5.59] for J48 and by −6.94% for RandomForest in the worst case. For Naïve Bayes, the differences with the original data where smaller with a maximum deterioration of −2.67%. SMO sees deteriorations of up to 5.71% in the scenarios of most missing data. We applied the Wilcoxon Signed Rank test to check statistical significance over the differences. Significant values are marked with * and tend to be those for the higher scenarios, except for SMO where the differences are more often statistically significant. From this we can conclude that simple imputation may work well for low amounts of missing data, and is beneficial over complete case analysis, but performance deteriorates significantly when the amount of missing data increases.
We also applied the Friedman test described by Demšar [3] and found statistically significant differences over multiple datasets in all scenarios except scenario 9 so we proceeded with the Nemenyi Test. We perform the post test between the classifiers over the imputed datasets for each scenario separately. The resulting Critical Difference diagrams in most scenarios in Fig. 4 show that RandomForest and SMO outperform J48 and Naïve Bayes. Random Forest seems to outperform SMO as the amount of missing data increases but not significantly. There is no statistical difference between RandomForest, SMO, and J48 in most scenarios. Overall, RandomForest was the most accurate classifier when the uncertainty increases and Naïve Bayes was the worst.
5.3 Building Models with Missing Data
In Sect. 3.1 we discussed that some of this algorithm have their own ways of dealing with missing data. We therefore pass all the data including missing data to the algorithms without preprocessing. We again compare (%Diff) in accuracy between a classifier obtained with the original (complete) datasets and the models built with missing data and show results in Table 4 with statistically significant differences marked by *. %Diff increases when missing data increases in all classifiers. However, for J48 in most scenarios the deterioration is within a small range [−0.19%,−3.95%] and similarly for RandomForest [−0.41%,−5.85%]. Naïve Bayes only ignores the missing values when computing the probability and the differences ranged between [+0.22%,−1.42]. SMO uses (mean/mode) imputation so behaves similarly to the imputed data performance in Table 4. In scenarios 8 and 9, the accuracy of all classifiers are statistically different from the classifiers’ accuracy for the original datasets. Thus, the capabilities of classifiers dealing with missing data seem to deteriorate when the ratio of missing data increases.
As before we apply the Friedman test and Nemenyi post test. The resulting Critical Difference diagrams in most scenarios show that RandomForest and SMO outperform J48 and Naïve Bayes. However, there is no statistical difference between all classifiers in scenario 9 whereas no statistical significant between RandomForest, SMO and J48 and between Naïve Bayes and J48. SMO was the most accurate classifier in the first six scenarios, however, when increasing missing data RandomForest outperforms other classifiers and Bayes was the worst in all scenarios. Figure 5 represents the Critical Difference diagrams of all scenarios.
6 Discussion
With complete data, Naïve Bayes and J48 perform worse than SMO and Random Forest. Complete case analysis results in many datasets becoming infeasible for analysis due to sparsity of the data for the algorithms we tested, thus it is not recommended if missing values are spread among records in high dimensional data. Simple imputation works well for low amounts of missing data but not when the amount of missing data increases substantially (scenarios 8,9), as the performance of all classifiers becomes statistically significantly worse than classifying with complete data. RandomForest and SMO behave better than J48 and Naïve Bayes in all scenarios (including when complete data is available). The capability to cope with missing data for RandomForest by using fractional method when uncertainty increases seems to outperform the SMO handling of missing data using mean/mode but not significantly.
7 Conclusion
Accuracy deteriorates for most classifiers when increasing percentages of missing data are encountered. Complete case analysis is not recommended if missing values are spread among (Features/Records) in high dimensional data. Simple imputation may help when a dataset has low ratio of missing values but not with increasing uncertainty. When applying the algorithms without preprocessing, again the trend is for some deterioration in performance with increasing missing data with those differences becoming statistically significant for the higher scenarios. So overall, we expect models to become worse as the amount of missing data increases though different algorithms do not perform significantly differently under those scenarios. As future work, we will expand on our imputation to include multiple imputation that combines models generated from multiple imputed datasets with data ensemble techniques to improve the performance of data mining classification algorithms for data with missing values.
References
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-cost sensitive naive bayes classification. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 51–58. IEEE (2004)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)
Fichman, A., Cummings, J.N.: Multiple imputation for missing data: Making the most of what you know. Organ. Res. Meth. 6(3), 282–308 (2003)
García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Gavankar, S., Sawarkar, S.: Decision tree: Review of techniques for missing values at training, testing and compatibility. In: 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. IEEE (2015)
George-Nektarios, T.: Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens (2013)
Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45554-X_46
Horton, N., Kleinman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)
Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11(1), 51 (2011)
Kohavi, R., Becker, B., Sommerfield, D.: Improving simple bayes. In: Proceedings of the European Conference on Machine Learning. Citeseer (1997)
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Little, R.J.A., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, Hoboken (2014)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)
Quinlan, J.R., et al.: Bagging, boosting, and c4. 5. In: The Association for the Advancement of Artificial Intelligence (AAAI), vol. 1, pp. 725–730 (1996)
Donald, B.: Rubin. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434), 473–489 (1996)
Scheffer, J.: Dealing with missing data. Res. Lett. Inf. Math. Sci. 3(1), 153–160 (2002)
Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge (1999)
Soley-Bori, M.: Dealing with missing data: Key assumptions and methods for applied analysis. Boston University School of Public Health (2013)
Tabachnick, B.G., Fidell, L.S., Osterlind, S.J.: Using Multivariate Statistics. Allyn and Bacon, Boston (2001)
Tran, C.T., Zhang, M., Andreae, P., Xue, B., Bui, L.T.: Multiple imputation and ensemble learning for classification with incomplete data. In: Leu, G., Singh, H.K., Elsayed, S. (eds.) Intelligent and Evolutionary Systems. PALO, vol. 8, pp. 401–415. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49049-6_29
van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Angus, N., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Acknowledgements
This work was supported by the Economic and Social Research Council (grant number ES/L011859/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Aleryani, A., Wang, W., De La Iglesia, B. (2018). Dealing with Missing Data and Uncertainty in the Context of Data Mining. In: de Cos Juez, F., et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham. https://doi.org/10.1007/978-3-319-92639-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-92639-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92638-4
Online ISBN: 978-3-319-92639-1
eBook Packages: Computer ScienceComputer Science (R0)