The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Sahu, Siddhartha; Mhedhbi, Amine; Salihoglu, Semih; Lin, Jimmy; Özsu, M. Tamer

doi:10.1007/s00778-019-00548-x

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Special Issue Paper
Published: 29 June 2019

Volume 29, pages 595–618, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The VLDB Journal Aims and scope Submit manuscript

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Download PDF

Siddhartha Sahu ORCID: orcid.org/0000-0003-1174-5115¹,
Amine Mhedhbi¹,
Semih Salihoglu¹,
Jimmy Lin¹ &
…
M. Tamer Özsu¹

2827 Accesses
46 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software. We hope these findings can guide future research.

The Various Graphs in Graph Computing

Introduction to Graph Databases

Graph Databases

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Graph data representing connected entities and their relationships appear in many application domains, most naturally in social networks, the Web, the Semantic Web, road maps, communication networks, biology, and finance, just to name a few examples. There has been a noticeable increase in the prevalence of work on graph processing both in research and in practice, evidenced by the surge in the number of different commercial and research software for managing and processing graphs. Examples include graph database systems [13, 20, 26, 49, 65, 73, 90], RDF engines [52, 96], linear algebra software [17, 63], visualization software [25, 29], query languages [41, 72, 78], and distributed graph processing systems [30, 34, 40]. In the academic literature, a large number of publications that study numerous topics related to graph processing regularly appear across a wide spectrum of research venues.

Despite their prevalence, there is little research on how graph data are actually used in practice and the major challenges facing users of graph data, both in industry and in research. In April 2017, we conducted an online survey across 89 users of 22 different software products, with the goal of answering 4 high-level questions:

(i)
What types of graph data do users have?
(ii)
What computations do users run on their graphs?
(iii)
Which software do users use to perform their computations?
(iv)
What are the major challenges users face when processing their graph data?

Our major findings are as follows:

Variety Graphs in practice represent a very wide variety of entities, many of which are not naturally thought of as vertices and edges. Most surprisingly, traditional enterprise data comprised of products, orders, and transactions, which are typically seen as the perfect fit for relational systems, appear to be a very common form of data represented in participants’ graphs.
Ubiquity of very large graphs Many graphs in practice are very large, often containing over a billion edges. These large graphs represent a very wide range of entities and belong to organizations at all scales from very small enterprises to very large ones. This refutes the sometimes heard assumption that large graphs are a problem for only a few large organizations such as Google, Facebook, and Twitter.
Challenge of scalability Scalability is unequivocally the most pressing challenge faced by participants. The ability to process very large graphs efficiently seems to be the biggest limitation of existing software.
Visualization Visualization is a very popular and central task in participants’ graph processing pipelines. After scalability, participants indicated visualization as their second most pressing challenge, tied with challenges in graph query languages.
Prevalence of RDBMSes Relational databases still play an important role in managing and processing graphs.

Our survey also highlights other interesting facts, such as the prevalence of machine learning on graph data, e.g., for clustering vertices, predicting links, and finding influential vertices.

We further reviewed user feedback in the mailing lists, bug reports, and feature requests in the source code repositories of 22 software products between January and September of 2017 with two goals: (i) to answer several new questions that the participants’ responses raised and (ii) to identify more specific challenges in different classes of graph technologies than the ones we could identify in participants’ responses. For some of the questions in our online survey, we also compared the graph data, computations, and software used by the participants with those studied in academic publications. For this, we reviewed 252 papers from 3 different year’s proceedings of 7 conferences across different academic venues.

Different database technologies and research topics are often motivated with a small set of common applications, informally referred to as “killer” applications of the technology. For example, object-oriented database systems are associated with computer-aided design and manufacturing, and XML is associated with the Web. An often-asked question in the context of graphs is: What is the killer application of graph software products? The wide variety of graphs and industry fields mentioned by our online survey participants hinted that we cannot pinpoint a small set of such applications. To better understand the applications supported by graphs, we reviewed the white papers posted on the Web sites of 8 graph software products. We also interviewed 6 users and 2 developers of graph processing systems. Our reviews and interviews corroborated our findings that graphs have a very wide range of applications but also highlighted several common applications, primarily in data integration, recommendations, and fraud detection, as well as several new applications we had not identified in our online survey. Our interviews also give more details than our online survey about the actual graphs used by enterprises and how they are used in applications.

In addition to discussing the insights we gained through our study, we discuss several directions about the future of graph processing. We hope our study can inform research about real use cases and important problems in graph processing.

2 Methodology of online survey, mailing lists, source repositories, and academic publications

In this section, we first describe the format of our survey and then how we recruited the participants. Next we describe the demographic information of the participants, including the organizations they come from and their roles in their organizations. Then we describe our methodology of reviewing academic publications. Then we describe our methodology for reviewing the user feedback in the mailing lists, bug reports, and feature requests in the source code repositories of the software products. We end this section with a discussion of our methodology, which we believe other researchers can easily reproduce to study the uses of other technology, and some lessons we learned from our experience of performing a user study. We review our methodology of reviewing white papers and our interviews in Sects. 4.1 and 5.1, respectively.

2.1 Online survey format and participants

2.1.1 Format

The survey was in the format of an online form. All of the questions were optional, and participants could skip any number of questions. There were 2 types of questions:

(i)
Multiple-choice There were 3 types of multiple-choice questions: (a) yes–no questions; (b) questions that allowed only a single choice as a response; and (c) questions that allowed multiple choices as a response. The participants could use an Other option when their answers required further explanation or did not match any of the provided choices. We randomized the order of choices in questions about the computations participants run and the challenges they face.
(ii)
Short-answer For these questions, the participants entered their responses in a text box.

There were 34 questions grouped into six categories: (i) demographic questions; (ii) graph datasets; (iii) graph and machine learning computations; (iv) graph software; (v) major challenges; and (vi) workload breakdown.

2.1.2 Participant recruitment

We prepared a list of 22 popular software products for processing graphs (see Table 1) that had public user mailing lists covering 6 types of technologies: graph database systems, RDF engines, distributed graph processing systems (DGPSes), graph libraries to run and compose graph algorithms, visualization software, and graph query languages.^{Footnote 1} Our goal was to be as comprehensive as possible in recruiting participants from the users of different graph technologies. However, we acknowledge that this list is incomplete and does not cover all of the graph software used in practice.

Table 1 Software products used for recruiting participants and the count of active users in their mailing list in February–April 2017

The ubiquity of large graphs and surprising challenges of graph processing: extended survey

Abstract

Similar content being viewed by others

The Various Graphs in Graph Computing

Introduction to Graph Databases

Graph Databases

Explore related subjects

1 Introduction

2 Methodology of online survey, mailing lists, source repositories, and academic publications

2.1 Online survey format and participants

2.1.1 Format

2.1.2 Participant recruitment

2.2 Review of academic publications

2.3 Review of emails and code repositories

2.4 Note on methodology

3 Online survey

3.1 Graph datasets

3.1.1 Real-world entities represented

3.1.2 Size

3.1.3 Other questions on graph datasets

3.2 Computations

3.2.1 Graph computations

3.2.2 Machine learning computations

3.2.3 Other questions on computations

3.3 Graph software

3.3.1 Software types

3.3.2 Other questions on software

3.4 Practical challenges

3.4.1 Challenges identified from survey

3.4.2 Challenges identified from review

3.5 Workload breakdown

4 Applications from white papers

4.1 Methodology

4.2 Applications

5 Applications from interviews

5.1 Methodology

5.2 Overall observations

5.3 Recommendations

5.3.1 Keyword recommendations on Alibaba’s E-commerce Web site

5.3.2 Configuration recommendation: Siemens’ automation systems

5.4 Fraud and threat detection

5.4.1 Fake transactions on Alibaba.com

5.4.2 Cycle patterns

5.4.3 Bipartite patterns

5.4.4 Application at a large financial institution

5.5 Question answering with personal assistant products of Alibaba and Amazon

5.6 Contingency analysis of power failures at State Grid

6 Related work

7 Conclusion and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (xlsx 83 KB)

Appendices

Choices of graph computations

Choices of machine learning computations

Storage in multiple formats

Other tables from the survey

Other applications from interviews

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation