Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop

Shaw, Marianne; Koutris, Paraschos; Howe, Bill; Suciu, Dan

doi:10.1007/978-3-642-32925-8_17

Marianne Shaw¹⁸,
Paraschos Koutris¹⁸,
Bill Howe¹⁸ &
…
Dan Suciu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7494))

Included in the following conference series:

International Datalog 2.0 Workshop

904 Accesses
25 Citations
3 Altmetric

Abstract

We explore the design and implementation of a scalable Datalog system using Hadoop as the underlying runtime system. Observing that several successful projects provide a relational algebra-based programming interface to Hadoop, we argue that a natural extension is to add recursion to support scalable social network analysis, internet traffic analysis, and general graph query. We implement semi-naive evaluation in Hadoop, then apply a series of optimizations spanning fundamental changes to the Hadoop infrastructure to basic configuration guidelines that collectively offer a 10x improvement in our experiments. This work lays the foundation for a more comprehensive cost-based algebraic optimization framework for parallel recursive Datalog queries.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Distributed Graph Analytics with Datalog Queries in Flink

RDF-SQ: Mixing Parallel and Sequential Computation for Top-Down OWL RL Inference

PathQuery Pregel: high-performance graph query with bulk synchronous processing

Article 01 August 2019

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Google Scholar
Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT 2011, pp. 1–8. ACM, New York (2011)
Chapter Google Scholar
Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.: Boom analytics: exploring data-centric, declarative programming for the cloud. In: Morin, C., Muller, G. (eds.) EuroSys, pp. 223–236. ACM (2010)
Google Scholar
Balduccini, M., Pontelli, E., Elkhatib, O., Le, H.: Issues in parallel execution of non-monotonic reasoning systems. Parallel Comput. 31(6), 608–647 (2005)
Article Google Scholar
Barga, R., Ekanayake, J., Jackson, J., Lu, W.: Daytona: Iterative mapreduce on windows azure, http://research.microsoft.com/en-us/projects/daytona/
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K.-L. (eds.) ICDE, pp. 1151–1162. IEEE Computer Society (2011)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.: Haloop: Efficient iterative data processing on large clusters. In: Proc. of International Conf. on Very Large Databases, VLDB (2010)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Eisner, J., Filardo, N.W.: Dyna: Extending Datalog for Modern AI. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 181–220. Springer, Heidelberg (2011)
Chapter Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: HPDC, pp. 810–818 (2010)
Google Scholar
Hadoop, http://hadoop.apache.org/
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Ferreira, P., Gross, T.R., Veiga, L. (eds.) EuroSys, pp. 59–72. ACM (2007)
Google Scholar
Joslyn, C., Adolf, R., Al Saffar, S., Feo, J., Goodman, E., Haglin, D., Mackey, G., Mizell, D.: High performance semantic factoring of giga-scale semantic graph databases. In: Semantic Web Challenge Billion Triple Challenge (2010)
Google Scholar
Loo, B.T., Gill, H., Liu, C., Mao, Y., Marczak, W.R., Sherr, M., Wang, A., Zhou, W.: Recent Advances in Declarative Networking. In: Russo, C., Zhou, N.-F. (eds.) PADL 2012. LNCS, vol. 7149, pp. 1–16. Springer, Heidelberg (2012)
Chapter Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Elmagarmid, A.K., Agrawal, D. (eds.) SIGMOD Conference, pp. 135–146. ACM (2010)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Wang, J.T.-L. (ed.) SIGMOD Conference, pp. 1099–1110. ACM (2008)
Google Scholar
Perri, S., Ricca, F., Sirianni, M.: A parallel asp instantiator based on dlv. In: Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, DAMP 2010, pp. 73–82. ACM, New York (2010)
Chapter Google Scholar
Shaw, M., Detwiler, L., Noy, N., Brinkley, J.: S.D.: vsparql: A view definition language for the semantic web. Journal of Biomedical Informatics (2010), doi:10.1016/j.jbi, 08.008
Google Scholar
Zaharia, M., Chowdhury, N.M.M., Franklin, M., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. Technical Report UCB/EECS-2010-53, EECS Department, University of California, Berkeley (May 2010)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, USA
Marianne Shaw, Paraschos Koutris, Bill Howe & Dan Suciu

Authors

Marianne Shaw
View author publications
You can also search for this author in PubMed Google Scholar
Paraschos Koutris
View author publications
You can also search for this author in PubMed Google Scholar
Bill Howe
View author publications
You can also search for this author in PubMed Google Scholar
Dan Suciu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Universidad de Chile, Avenida Blanco Encalada 2120, 3er Piso, 5837-0459, Santiago, Chile
Pablo Barceló
Institute of Information Systems, Vienna University of Technology, Favoritenstraße 9-11, 1040, Wien, Austria
Reinhard Pichler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shaw, M., Koutris, P., Howe, B., Suciu, D. (2012). Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop. In: Barceló, P., Pichler, R. (eds) Datalog in Academia and Industry. Datalog 2.0 2012. Lecture Notes in Computer Science, vol 7494. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32925-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-32925-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32924-1
Online ISBN: 978-3-642-32925-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop

Abstract

Chapter PDF

Similar content being viewed by others

Distributed Graph Analytics with Datalog Queries in Flink

RDF-SQ: Mixing Parallel and Sequential Computation for Top-Down OWL RL Inference

PathQuery Pregel: high-performance graph query with bulk synchronous processing

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop

Abstract

Chapter PDF

Similar content being viewed by others

Distributed Graph Analytics with Datalog Queries in Flink

RDF-SQ: Mixing Parallel and Sequential Computation for Top-Down OWL RL Inference

PathQuery Pregel: high-performance graph query with bulk synchronous processing

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation