Abstract
So far, many studies on join operations on MapReduce have already been proposed. Some of those studies focus on the low-selectivity joins that are frequently used in query processing for datasets among several management domains, such as those used with Linked Open Data. We found there is room for improvement of the state-of-the-art approach for the low-selectivity join on MapReduce called the per-split semi-join[5]. In this paper, we first thus extend the per-split semi-join to improve performance. Our approach can reduce three costs for low-selectivity joins: the amount of network traffic, the number of jobs, and the amount of disk I/O. Moreover, when the number of input relations is large, the selectivity becomes low and thus the effect of our proposed reductions is maximized. Therefore, we also propose an extension of the reduce-side join, which can easily apply three or more inputs, based on the extension of the per-split semi-join. In our experiments, we evaluated two comparisons: the per-split semi-join and our extension of it and the reduce-side join and our extension of it. The first experiment shows that our extension is better than its competitor in any case. In the second experiment, we found that our extension is superior to its competitor when the selectivity is low and the number of inputs is three or more.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT 2010, pp. 99–110. ACM, New York (2010). http://doi.acm.org/10.1145/1739041.1739056
Andreas, C.: Designing a Parallel Query Engine over Map/Reduce. Master’s thesis, Informatics MSc, School of Informatics, University of Edinburgh (2010)
Atta, F.: Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework. Master’s thesis, Informatics MSc, School of Informatics, University of Edinburgh (2010)
Berners-Lee, T.: Linked data (August 27, 2006). http://www.w3.org/DesignIssues/LinkedData.html
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010). http://doi.acm.org/10.1145/1807167.1807273
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). http://doi.acm.org/10.1145/362686.362692
Chandar, J.: Join Algorithms using Map/Reduce. Master’s thesis, Informatics MSc, School of Informatics, University of Edinburgh (2010)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design & Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004). http://dl.acm.org/citation.cfm?id=1251254.1251264
DeWitt, D.J., Gerber, R.H.: Multiprocessor hash-based join algorithms. In: Proceedings of the 11th International Conference on Very Large Data Bases, VLDB 1985, vol. 11, pp. 151–164. VLDB Endowment (1985). http://dl.acm.org/citation.cfm?id=1286760.1286774
Gates, A.: Programming Pig. O’Reilly, Media (September 2011)
Jadhav, V., Aghav, J., Dorwani, S.: Join algorithms using mapreduce: a survey. In: International Conference on Electrical Engineering and Computer Science, ICETACS 2013, Coimbatore, pp. 40–44, April 2013
Radu, V.: Application. In: Radu, V. (ed.) Stochastic Modeling of Thermal Fatigue Crack Growth. ACM, vol. 1, pp. 63–70. Springer, Heidelberg (2015)
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: A survey. SIGMOD Rec. 40(4), 11–20 (2012). http://doi.acm.org/10.1145/2094114.2094118
Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in mapreduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. ACM, New York (2012). http://doi.acm.org/10.1145/2401603.2401626
Luo, G., Dong, L.: Adaptive join plan generation in hadoop. Tech. rep., Duke University (2010), cPS296.1 Course Project
Pigul, A.: Generalized parallel join algorithms and designing cost models. In: Proceedings of the Spring Young Researcher’s Colloquium On Database and Information Systems, SYRCoDIS 2012, pp. 29–40 (2012)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media / Yahoo Press, 3rd edn. (May 2012)
Zhang, C., Wu, L., Li, J.: Efficient processing distributed joins with bloomfilter using mapreduce. International Journal of Grid and Distributed Computing 6(3), 43–58 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Matono, A., Ogawa, H., Kojima, I. (2015). Improvement of Join Algorithms for Low-Selectivity Joins on MapReduce. In: Sharaf, M., Cheema, M., Qi, J. (eds) Databases Theory and Applications. ADC 2015. Lecture Notes in Computer Science(), vol 9093. Springer, Cham. https://doi.org/10.1007/978-3-319-19548-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-19548-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19547-6
Online ISBN: 978-3-319-19548-3
eBook Packages: Computer ScienceComputer Science (R0)