Abstract
The growing adoption of the MapReduce programming model increases the appeal of using Internet-wide computing platforms to run MapReduce applications on the Internet. However, current data distribution techniques, used in such platforms to distribute the high volumes of information which are needed to run MapReduce jobs, are naive, and therefore fail to offer an efficient approach for running MapReduce over the Internet. Thus, we propose a computing platform called freeCycles that runs MapReduce jobs over the Internet and provides two new main contributions: i) it improves data distribution, and ii) it increases intermediate data availability by replicating tasks or data through nodes in order to avoid losing intermediate data and consequently avoiding significant delays on the overall MapReduce execution time. We present the design and implementation of freeCycles, in which we use the BitTorrent protocol to distribute all data, along with an extensive set of performance results, which confirm the usefulness of the above mentioned contributions. Our system’s improved data distribution and availability makes it an ideal platform for large scale MapReduce jobs.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing mapreduce on heterogeneous clusters. SIGARCH Comput. Archit. News 40(1), 61–74 (2012)
Alexandrov, A.D., Ibel, M., Schauser, K.E., Scheiman, C.J.: Superweb: towards a global web-based parallel computing infrastructure. In: Parallel Processing Symposium, 1997. Proceedings., 11th International, pp 100–106 (1997)
Anderson, D.P.: Boinc: a system for public-resource computing and storage. In: 2004. Proceedings. Fifth IEEE/ACM International Workshop on Grid Computing, pp 4–10 (2004)
Anderson, D.P., Christensen, C., Allen, B.: Designing a runtime system for volunteer computing. In: SC 2006 Conference, Proceedings of the ACM/IEEE, pp 33–33 (2006)
Anderson, D.P., Fedak, G.: The computational and storage potential of volunteer computing. In: 2006. CCGRID 06. Sixth IEEE International Symposium on Cluster Computing and the Grid, vol. 1, pp 73–80 (2006)
Baratloo, A., Karaul, M., Kedem, Z.M., Wijckoff, P.: Charlotte: Metacomputing on the web. Futur. Gener. Comput. Syst. 15(5–6), 559–570 (1999)
Bazinet, A.L., Cummings, M.P.: Subdividing long-running, variable-length analyses into short, fixed-length boinc workunits. J. Grid Comput. 14(3), 429–441 (2016)
Bertis, V., Bolze, R., Desprez, F., Reed, K.: From dedicated grid to volunteer grid: Large scale execution of a bioinformatics application. J. Grid Comput. 7(4), 463 (2009)
Binzenhöfer, A., Leibnitz, K.: Estimating churn in structured p2p networks. In: Managing Traffic Performance in Converged Networks, pp 630–641. Springer, Berlin (2007)
Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Proj. Website 11, 21 (2007)
Bruno, R., Ferreira, P.: Scadamar: Scalable and data-efficient internet mapreduce. In: Proceedings of the 2Nd International Workshop on CrossCloud Systems, CCB’14, pp 2:1–2:6. ACM, New York (2014)
Cardosa, M., Wang, C., Nangia, A., Chandra, A., Weissman, J.: Exploring mapreduce efficiency with highly-distributed data, In Proceedings of the Second International Workshop on MapReduce and its Applications, 27–34, ACM, New York (2011)
Castro, M., Liskov, B., et al.: Practical byzantine fault tolerance. In: OSDI, vol. 99, pp 173–186 (1999)
Chakravarti, A.J., Baumgartner, G., Lauria, M.: The organic grid: self-organizing computation on a peer-to-peer network. IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans 35(3), 373–384 (2005)
Cherkasova, L., Lee, J.: Fastreplica: Efficient large file distribution within content delivery networks. In: USENIX Symposium on Internet Technologies and Systems, Seattle (2003)
Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Comput. Commun. Rev. 41(4), 98–109 (2011)
Chun, B., Culler, D., Roscoe, T., Bavier, A., Peterson, L., Wawrzoniak, M., Bowman, M.: Planetlab: an overlay testbed for broad-coverage services. ACM SIGCOMM Comput. Commun. Rev. 33(3), 3–12 (2003)
Costa, F., Veiga, L., Ferreira, P.: Internet-scale support for map-reduce processing. J. Internet Serv. Appl. 4(1), 1–17 (2013)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dinu, F., Ng, T.S.: Understanding the effects and implications of compute node related failures in hadoop. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp 187–198. ACM, New York (2012)
Fedak, G., Germain, C., Neri, V., Cappello, F.: Xtremweb: a generic global computing system. In: 2001. Proceedings. First IEEE/ACM International Symposium on Cluster Computing and the Grid, pp 582–587 (2001)
Fedak, G., He, H., Cappello, F.: Bitdew: A data management and distribution service with multi-protocol file transfer and metadata abstraction. J. Netw. Comput. Appl. 32(5), 961–975 (2009). Next Generation Content Networks
Gentzsch, W., Girou, D., Kennedy, A., Lederer, H., Reetz, J., Riedel, M., Schott, A., Vanni, A., Vazquez, M., Wolfrat, J.: Deisa—distributed european infrastructure for supercomputing applications. J. Grid Comput. 9(2), 259–277 (2011)
Georgatos, F., Gkamas, V., Ilias, A., Kouretis, G., Varvarigos, E.: A grid-enabled cpu scavenging architecture and a case study of its use in the greek school network. J. Grid Comput. 8(1), 61–75 (2010)
Heckmann, O., Bock, A.: The edonkey 2000 protocol. Rapport technique, Multimedia Communications Lab, Darmstadt University of Technology, 13 (2002)
Heien, E.M., Anderson, D.P., Hagihara, K.: Computing low latency batches with unreliable workers in volunteer computing environments. J. Grid Comput. 7(4), 501 (2009)
Kailasam, S., Dhawalia, P., Balaji, S.J., Iyer, G., Dharanipragada, J.: Extending mapreduce across clouds with bstream. IEEE Trans. Cloud Comput. 2(3), 362–376 (2014)
Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: Making cloud intermediate data fault-tolerant. In: Proceedings of the 1st ACM Symposium on Cloud Computing, p 181–192. ACM, Berlin (2010)
Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., et al.: Oceanstore: An architecture for global-scale persistent storage. ACM Sigplan Not. 35(11), 190–201 (2000)
Langville, A.N., Meyer, C.D.: Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton (2011)
Li, P., Guo, S., Yu, S., Zhuang, W.: Cross-cloud mapreduce for big data. IEEE Trans. Cloud Comput. PP(99), 1–1 (2015)
Liang, J., Kumar, R., Ross, K.W.: The fasttrack overlay: A measurement study. Comput. Netw. 50(6), 842–858 (2006)
Lin, H., Ma, X., Archuleta, J., Feng, W.-c., Gardner, M., Zhang, Z.: Moon: Mapreduce on opportunistic environments. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pp 95–106. ACM, New York (2010)
Lo, V., Zappala, D., Zhou, D., Liu, Y., Zhao, S.: Cluster computing on the fly: P2p scheduling of idle cycles in the internet. In: Peer-to-Peer Systems III, pp 227–236. Springer, Berlin (2005)
Marozzo, F., Talia, D., Trunfio, P.: Adapting mapreduce for dynamic environments using a peer-to-peer model. In: Proceedings of the 1st Workshop on Cloud Computing and its Applications (2008)
Nguyen, T., Shi, W.: Improving resource efficiency in data centers using reputation-based resource selection. In: Green Computing Conference, 2010 International, pp 389–396, USA (2010)
Pouwelse, J., Garbacki, P., Epema, D., Sips, H.: The bittorrent p2p file-sharing system: Measurements and analysis. In: Peer-to-Peer Systems IV, pp 205–216. Springer, Berlink (2005)
Qureshi, M.B., Dehnavi, M.M., Min-Allah, N., Qureshi, M.S., Hussain, H., Rentifis, I., Tziritas, N., Loukopoulos, T., Khan, Samee U., Xu, C.-Z., Zomaya, A.Y.: Survey on grid resource allocation mechanisms. J. Grid Comput. 12(2), 399–441 (2014)
Rasooli, A., Down, D.G.: Guidelines for selecting hadoop schedulers based on system heterogeneity. J. Grid Comput. 12(3), 499–519 (2014)
Ripeanu, M.: Peer-to-peer architecture case study: Gnutella network. In: 2001. Proceedings. First International Conference on Peer-to-Peer Computing, pp 99–100. IEEE, USA (2001)
Rood, B., Lewis, M.J.: Grid resource availability prediction-based scheduling and task replication. J. Grid Comput. 7(4), 479 (2009)
Sarmenta, L.F.G., Hirano, S.: Bayanihan: building and studying web-based volunteer computing systems using java. Futur. Gener. Comput. Syst. 15(5–6), 675–686 (1999)
Silberstein, M., Sharov, A., Geiger, D., Schuster, A.: Gridbot: execution of bags of tasks in multiple grids. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC’09, pp 11:1–11:12. ACM, New York (2009)
Singh, S., Chana, I.: A survey on resource scheduling in cloud computing Issues and challenges. J. Grid Comput. 14(2), 217–264 (2016)
Stutzbach, D., Rejaie, R.: Understanding churn in peer-to-peer networks, In Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, 189–202, ACM, New York (2006)
Tang, B., Moca, M., Chevalier, S., He, H., Fedak, G.: Towards mapreduce for desktop grid computing. In: 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), pp 193–200 (2010)
Tang, B., Tang, M., Fedak, G., He, H.: Availability/network-aware mapreduce over the internet. Inf. Sci. 379, 94–111 (2017)
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurr. Comput. Pract. Exper. 17(2-4), 323–356 (2005)
Toth, D., Finkel, D.: Improving the productivity of volunteer computing by using the most effective task retrieval policies. J. Grid Comput. 7(4), 519 (2009)
White, T.: O’Reilly (2012)
Yang, S., Butt, A.R., Fang, X., Hu, Y.C., Midkiff, S.P.: A fair, secure and trustworthy peer-to-peer based cycle-sharing system. J. Grid Comput. 4(3), 265–286 (2006)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp 10–10 (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bruno, R., Costa, F. & Ferreira, P. freeCycles - Efficient Multi-Cloud Computing Platform. J Grid Computing 15, 501–526 (2017). https://doi.org/10.1007/s10723-017-9414-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-017-9414-2