Abstract
Before consuming datasets for any application, we need to understand the dataset at hand and its metadata. Discovering metadata process known as data profiling. Data profiling focus on examining the data sets and collecting metadata such as statistics or informative summaries about that data. In this chapter, we will discuss the importance of data profiling and shed light on the area of data profiling in big data. In addition, we will detail data profiling use cases and reviewing the state-of-the-art data profiling systems and techniques. Finally, we conclude with directions and challenges for future research in the area of data profiling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Olsen, J.E.: Data Quality: The Accuracy Dimension. Morgan Kaufmann Publishers. ISBN 1558608915 (2003)
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24, 557–581 (2015)
Hildebrandt, M., de Vries, K.: Privacy, Due Process and the Computational Turn, 43 (58 / 271). Routledge, New York (2013)
Dixon, J.: Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog (2010)
Abedjan, Z., Naumann, F.: Advancing the Discovery of Unique Column Combinations. Universittsverlag Potsdam (2011). ISBN 978-3-86956-148-6
Johnson, T.: Data Profiling, Encyclopedia of Database Systems, pp. 604–608. Springer, Heidelberg (2009)
Suereth, R., Ennis, W., Clavens, G.: Systems and methods of profiling data for integration, United Parcel Service of America Inc., US7912867B2, US12/036,611 (2008)
Heise, A., Quiané-Ruiz, J.A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. Proc. VLDB Endow. 7(4) (2013)
Bauckmann, J., Leser, U., Naumann, F., Tietz, V.: Efficiently detecting inclusion dependencies. In: International Conference on Data Engineering (ICDE 2007), Istanbul, Turkey (poster paper, to appear)
Papenbrock, Thorsten., Kruse, Sebastian., Quian-Ruiz, Jorge-Arnulfo, Naumann, Felix: Divide and conquer-based inclusion dependency discovery. Proc. VLDB Endow. 8(7), 774–785 (2015)
Abedjan, Z., Grütze, T., Jentzsch, A., Naumann, F.: Profiling and mining RDF data with ProLOD++. In: Proceedings of the International Conference on Data Engineering (ICDE) (2014)
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 240–251 (2002)
Raman, V., Hellerstein, J.M.: Potters wheel: an interactive data cleaning system. In: Proceedings of the International Conference on Very Large Databases (VLDB), Rome, Italy, pp. 381–390 (2001)
Golab, L., Karloff, H., Korn, F., Srivastava, D.: Data auditor: exploring data quality and semantics using pattern tableaux. Proc. VLDB Endow. 3(12), 16410–1644 (2010)
Chu, X., Ilyas, I., Papotti, P., Ye, Y.: RuleMiner: data quality rules discovery. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 1222–1225 (2014)
Hellerstein, J.M., Christopher, R., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)
Mohamed, F.S., Bellahsene, B.E.Z., Todorov, K.: Towards semantic dataset profiling. In: (2014)
Shoaib, M., Basharat, A.: Ontology based knowledge representation and semantic profiling in personalized semantic social networking framework. In: 2010 3rd International Conference on Computer Science and Information Technology. IEEE (2010)
Gangadharan, S.P.: Digital inclusion and data profiling. First Monday 17(5) (2012). https://doi.org/10.5210/fm.v17i5.3821
Bauckmann, J., Leser, U., Naumann, F., Tietz, V.: Efficiently detecting inclusion dependencies. In: Proceedings of the International Conference on Data Engineering (ICDE), Istanbul, Turkey, pp. 1448–1450 (2007)
Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., Naumann, F.: Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endow. 8(10), 1082–1093 (2015)
Heise, A., Quian-Ruiz, J.A., Abedjan, Z., Jentzsch, A., Naumann, F.: Scalable discovery of unique column combinations. Proc. VLDB Endow. 7, 301–312 (2013)
Bohm, C., Naumann, F., Abedjan, Z., Grutze, D.F.T., Hefenbrock, D., Pohl, M., Sonnabend, D.: Profiling linked open data with ProLOD. In: IEEE 26th International Conference on Data Engineering Workshops (ICDEW) (2010)
Buneman, P., Davidson, S., Fernandez, M., Suciu, D.: Adding structure to unstructured data. In: International Conference on Database Theory ICDT 1997: Database Theory ICDT 1997, pp. 336-350 (2005)
Bruinsma, G., Weisburd, D. (eds.) Encyclopedia of Criminology and Criminal Justice. Springer, New York (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Elbaghazaoui, B.E., Amnai, M., Semmouri, A. (2021). Data Profiling over Big Data Area. In: Gherabi, N., Kacprzyk, J. (eds) Intelligent Systems in Big Data, Semantic Web and Machine Learning. Advances in Intelligent Systems and Computing, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-72588-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-72588-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72587-7
Online ISBN: 978-3-030-72588-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)