Abstract
To guarantee the data quality, it is necessary to clean the missing data that prevalently exist in real world databases. By incorporating additional information, such as functional dependencies or integrity constraints, the correct value for each missing data item can be derived in many existing data cleaning methods. In this paper, we propose a method for cleaning the missing data item without additional information by adopting Bayesian network (BN) as the framework of the representation and inferences of probability distributions. First, we learn a Bayesian network from the complete part of the given incomplete database, called IBN. Then, we infer the probability distributions of each missing data item based on Gibbs sampling upon the IBN. Consequently, we obtain all possible values with their corresponding probability distributions (i.e., confidence degrees), by which we clean the incomplete databases. Experimental results showed the efficiency, accuracy and precision of our methods.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Muller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical report, Humboldt-Universitat zu Berlin (2003)
Arasu, A., Chaudhuri, S., Chen, Z., Ganjam, K., et al.: Experiences with using Data Cleaning Technology for Bing Services. IEEE Data Engineering Bulletin, 14–23 (2012)
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1), 197–207 (2010)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional Functional Dependencies for Data Cleaning. In: Chirkova, R., Dogac, A., Ozsu, M.T., Sellis, T.K. (eds.) Proc. of ICDE 2007, Istanbul, Turkey, pp. 746–755. IEEE Computer Society (2007)
Chen, H., Ku, W.S., Wang, H.: Cleansing Uncertain Databases Leveraging Aggregate Constraints. In: Workshops Proc. of ICDE 2010, California, USA, pp. 128–135. IEEE Computer Society (2010)
Srivastava, D.: Analyzing Data Quality Using Data Auditor. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 1–1. Springer, Heidelberg (2010)
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: A Database Approach for Statistical Inference and Data Cleaning. In: Elmagarmid, A.K., Agrawal, D. (eds.) Proc. of SIGMOD 2010, Indiana, USA, pp. 75–86. ACM (2010)
Stoyanovich, J., Davidson, S., Milo, T., Tannen, V.: Deriving Probabilistic Databases with Inference Ensembles. In: Abiteboul, S., Bohm, K., Koch, C., Tan, K.L. (eds.) Proc. of ICDE 2011, Hannover, Germany, pp. 303–314. IEEE Computer Society (2011)
Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press (2009)
Cheng, J., Greiner, R., Bell, D., Liu, W.: Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory. Artificial Intelligence 137(1-2), 43–90 (2002)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall (2009)
Cavallo, R., Pittarelli, M.: The Theory of Probabilistic Databases. In: Stocker, P.M., Kent, W., Hammersley, P. (eds.) Proc. of VLDB 1987, Brighton, England, pp. 71–81. Morgan Kaufmann (1987)
Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: A Probabilistic Databases Management System. In: Cetintemel, U., Zdonik, S.B., Kossmann, D., Tatbul, N. (eds.) Proc. of SIGMOD 2009, Rhode Island, USA, pp. 1071–1074. ACM (2009)
Benjelloun, O., Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with Uncertainty and Lineage. In: Dayal, U., Whang, K.Y., Lomet, D.B., Alonso, G.A., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.K. (eds.) Proc. of VLDB 2006, Seoul, Korea, pp. 953–964. Morgan Kaufmann (2006)
Norsys Software Corporation, http://www.norsys.com/
Cover, T., Thomas, J.: Elements of Information Theory. Wiley and Sons (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Duan, L., Yue, K., Qian, W., Liu, W. (2013). Cleaning Missing Data Based on the Bayesian Network. In: Gao, Y., et al. Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7901. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39527-7_34
Download citation
DOI: https://doi.org/10.1007/978-3-642-39527-7_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39526-0
Online ISBN: 978-3-642-39527-7
eBook Packages: Computer ScienceComputer Science (R0)