Abstract
Data mining techniques are widely applied and data warehousing is relatively important in this process. Both scalability and efficiency have always been the key issues in data warehousing. Due to the explosive growth of data, data warehousing today is facing tough challenges in these issues and traditional method encounters its bottleneck. In this paper, we present a document-based data warehousing approach. In our approach, the ETL process is carried out through MapReduce framework and the data warehouse is constructed on a distributed, document-oriented database. A case study is given to demonstrate details of the entire process. Comparing with RDBMS based data warehousing, our approach illustrates better scalability, flexibility and efficiency.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Gupta, V.R.: An Introduction to Data Warehousing. System Services Corporation (1997)
Tan, A.X., et al.: A Comparison of Approaches for Large-Scale Data Mining. Technical Report UTDCS-24-10 (2010)
Yang, L., Shi, Z.: An Efficient Data Mining Framework on Hadoop using Java Persistentce API. In: 10th IEEE International Conference on Computer and Information Technology (2010)
Zhao, J.: Designing Distributed Data Warehouses and OLAP Systems. In: ISTA 2005, pp. 254–263 (2005)
Sreenivasa Rao, V., Vidyavathi, S.: Distributed Data Mining And Mining Multi-agent Data. International Journal on Computer Science and Engineering (IJCSE) 02(04), 1237–1244 (2010)
Han, J., et al.: A Novel Solution of Distributed Memory NoSQL database for Cloud Computing. In: 2011 10th IEEE/ACIS International Conference on Computer and Information Science (2011), 978-0-7695-4401-4/11$26.00
Sen, A., Sinha, A.P.: A comparison of data warehousing methodologies. Communications of The ACM 48(3) (2005)
JSON, http://www.json.org/
Inmon, W.H.: Building the Data Warehouse. John Wiley (1992)
Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. ACM Sigmod Record (1997)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004)
Ghemawat, S., et al.: The Google File System. In: SOSP 2003. ACM (2003)
Chang, F., et al.: BigTable: A Distributed Storage System for Structured Data. In: OSDI (2006)
Apache Hadoop, http://hadoop.apache.org/
KDD Cup 2012, http://www.kddcup2012.org/
MongoDB, http://www.mongodb.org/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chai, H., Wu, G., Zhao, Y. (2013). A Document-Based Data Warehousing Approach for Large Scale Data Mining. In: Zu, Q., Hu, B., Elçi, A. (eds) Pervasive Computing and the Networked World. ICPCA/SWS 2012. Lecture Notes in Computer Science, vol 7719. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37015-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-37015-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37014-4
Online ISBN: 978-3-642-37015-1
eBook Packages: Computer ScienceComputer Science (R0)