SgIndex: An Index Structure Supporting Multiple Graph Queries

Zhu, Shibiao; Huang, Yuzhou; Zhang, Zirui; Qin, Xiaolin

doi:10.1007/978-3-031-25158-0_45

Shibiao Zhu¹³,
Yuzhou Huang¹³,
Zirui Zhang¹³ &
…
Xiaolin Qin¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13421))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

1000 Accesses

Abstract

With the rise of social networks, traffic navigation and other fields, graph applications become increasing extensive. To improve query efficiency, indexes are built to manage large-scale graph data. However, these indexes cost large memory space and can only support one single graph operation. We propose a two-layered index structure SgIndex, in which the first layer stores subgraph information, and the second layer stores adjacency information to support multiple path operations and subgraph matching queries. We propose a subgraph matching algorithm based on path join, which completes subgraph matching by searching SgIndex twice. The experimental results show that SgIndex achieves better performance on path queries and subgraph matching than existing index structures, and reduces memory overhead.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Using partial evaluation in holistic subgraph search

Article 18 September 2018

SuMGra: Querying Multigraphs via Efficient Indexing

Subgraph Matching Based on Path Adaptation for Large-Scale Graph

Keywords

1 Introduction

Large-scale graph data with complex internal structure and diverse query requirements have emerged in different fields and various data operations have appeared in the application of large-scale graph data [2, 11, 15]. On the one hand, some algorithms have difficulties in adapting to large-scale graphs. On the other hand, most of the existing index structures lack generality. In practical applications, multiple indexes need to be established for the same graph to response to different query requirements. Therefore, this paper proposes a subgraph-based index structure SgIndex: for large-scale graph data, establish an index structure that meets various query requirements.

First, graph operations are expensive on large-scale graph data. Compared with traditional relational data or XML tree, graph data lacks structural constraints and is complicated to operate. Second, flexible and diverse query requirements for large-scale graph data mean that multiple indexes need to be built to meet these requirements. Furthermore, graph data operations usually require loop iterations, which is also the part that our index needs to support.

We have made the following contributions in this paper: 1. We propose an index that supports large-scale datasets, which reduces storage space by optimizing the design of the index; 2. The index structure can implement various queries on graph data such as path query and subgraph query, including two-point path query, path query with limited path length, path query with limited path hops, and subgraph query.

2 Related Work

Graph is already a widely used data structure, but querying large-scale graph data has always been a difficult problem [5]. A popular solution is to build an index that does not waste too much storage space, but also produces effective positive feedback on the query process under various circumstances. However, in practical applications, there are often multiple queries that need to be satisfied for the same graph data [2, 14, 15] (Fig. 1). Given a directed graph G and two vertices s and t in it, a reachability query asks G if there is a path from s to t. Reference [13] proposes a set of tools to quantify and query network reachability, and uses decision graphs as data structures to represent reachability matrices. Based on two classical shortest path algorithms, the shortest path problem can be decomposed into a linear complex problem [1, 8], or the target solution can be optimized in terms of distance or time [7, 9, 10], or one or more preprocessing steps to speed up the shortest path query time [4, 12]. Subgraph is one of the basic concepts of graph theory, which refers to a graph in which the vertex set and the edge set are subsets of the vertex set and edge set of a certain graph, respectively. Reference [3] adopts a left-side deep join ranking strategy, which models the enumeration process as a join problem. Subtree features are also used for indexing, and they take less time to index than more general subgraph features. BINDEX [6] is a secondary index with excellent query efficiency. It consists of Filter Layer and Refine Layer.

3 Index Based on Subgraph

3.1 Index Structure and Establishment Method

SgIndex is a subgraph-based index structure. This index has two layers. First, we need to locate each vertex of the graph to the head of the adjacency list by hash. Secondly, for each vertex, we store interior vertex information and boundary vertex information for it, as shown in Fig. 2(a). Both types of vertex information include vertex value, path weight, and path hop. For each vertex, its internal vertices are vertices in the local subgraph, and the vertices directly adjacent to subgraph are classified as boundary vertices. The scale of the subgraph is denoted as threshold.

The selection of the threshold is not an optimal problem. For the currently implemented query, the larger the threshold, the faster the query speed. The only cost is the space occupied by the storage index and the cost of index establishment. Although for local information query, such as general contact query of the new crown epidemic, fast query can be guaranteed as long as the limit threshold is a specified number of hops, but in practical applications, it is often impossible to easily predict the required query scale, so the threshold should be chosen based on acceptable storage overhead.

Algorithm 1 gives the pseudocode of the new index algorithm. First we fill the edge information of the graph into the index (lines 1–7). Second we iterate over the interior vertices of each vertex until the complete subgraph information is stored (lines 8–20). We can build an index as shown in Fig. 2(b).

3.2 Path Query Algorithm

Algorithm 2 shows how to find a path of two points. First we query the index and get the interior set and boundary set of the starting vertex (lines 1–2). If the interior set contains the end vertex end, it will return path directly (lines 3–4). If not, it will iteratively search in the boundary set, and each round of iteration will use the vertex in the boundary set as the new starting vertex to perform a new path query, until the path is found (lines 5–11). The function \(FindPath(I, v_1, v_2, path)\) indicates that by querying the index I, it is judged whether \(v_2\) is in the interior set of \(v_1\). If it exists, path is updated. If it does not exist, the next iteration will be performed in the boundary set of \(v_2\).

3.3 Subgraph Matching Algorithm

The subgraph matching algorithm based on this index can be divided into three steps: 1. Decompose the query graph into multiple paths (line 1); 2. Generate an n-tuple for each path by index (lines 2–5); 3. Compare n-tuples by index to get the resulting subgraphs (lines 6–18).

The function Searchpath(I, h, w) means to search for paths with \(v_i.hop=h\) and \(v_i.weight=w\) through index I. The function will output a collection of paths. The function \(Join(path_1,path_2)\) means that the two paths and the join condition are combined into a subgraph, then the subgraph will be added to the subgraph set \(G_s\). The function \(Union(G_s,Path)\) compares subgraphs in the set \(G_s\) with paths in the set Path, and unions sunbgraphs and paths that meet the join conditions.

4 Experiments

4.1 Experimental Environment and Dataset

The machine used in this experiment has Intel(R) Core(TM) i5-10300H CPU @ 2.50 GHz processor, 16.0 GB memory, 64-bit operating system. The compiler used is Visual Studio 2019.

The datasets are all from KONECT, respectively: wikipedia_link_mi(Wmi), wikipedia_link_lez(Wlez), wikipedia_link_sah(Wsah), wikipedia_link_cy(Wcy), wikipedia_link_bn(Wbn). The scale of these datasets is shown in Table 1 (Fig. 4).

Table 1. Datasets

Full size table

4.2 Path Query

Figure 3(a) illustrates that the shortest path query time increases with the size of the dataset. For the smallest dataset Wmi, the query time on SgIndex is 50\(\%\) of that on G*Tree; while for the largest dataset Wbn, the query time required on SgIndex is 35\(\%\) of that on G*Tree. SgIndex performs better on larger datasets, which is determined by the structure of the bi-level index. SgIndex summarizes and stores subgraph information so that the shortest path can be obtained without traversing the entire graph. Figure 3(b) shows that SgIndex achieves better results in longer paths. Figure 3(c) is our test for extreme cases. When the required path does not exist, we need to traverse at least all the successor vertices of the starting vertex. Experiments show that SgIndex has withstood this test and achieved an advantage of more than 50\(\%\). Figure 3(d) is a comparison between using our index structure and using a mature graph data management system.

4.3 Subgraph Matching

We select the classic SPATH index in the field of subgraph matching and the above G*Tree as a comparison, and use the above real graph data to conduct experiments. Since SPATH takes up too much memory on large graphs, it exceeds the limit supported by our experimental environment. We only conduct experiments on three datasets, Wmi, Wlez and Wsah. We select subgraphs with 5, 6, 8, and 10 vertices for experiments. As can be seen from Fig. 5, SgIndex is more efficient than SPATH in subgraph matching query, especially with the increase of subgraph size. This is because the index established by SPATH still adopts the traditional method of finding candidate vertices and pruning them one by one. However, our method combines the advantages of indexing and invokes indexing in both the search candidate path stage and the pruning stage, thereby improving the overall query efficiency. Figure 6 shows how our index compares to different systems.

5 Conclusion

From the perspective of practical application, we propose an index that supports large-scale data sets, and reduces the storage space from the design of the index; at the same time, the index structure we propose is a general index structure that can realize queries on various graph data. Experiments show that our index has advantages in both storage space and query efficiency.

References

Arz, J., Luxen, D., Sanders, P.: Transit node routing reconsidered. In: Bonifaci, V., Demetrescu, C., Marchetti-Spaccamela, A. (eds.) SEA 2013. LNCS, vol. 7933, pp. 55–66. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38527-8_7
Chapter Google Scholar
Cui, W., Xiao, Y., Wang, H., Wang, W.: Local search of communities in large graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 991–1002 (2014)
Google Scholar
He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 405–418 (2008)
Google Scholar
Klein, P.N., Mozes, S., Weimann, O.: Shortest paths in directed planar graphs with negative lengths: a linear-space o (n log2 n)-time algorithm. ACM Trans. Algorith. (TALG) 6(2), 1–18 (2010)
Article MATH Google Scholar
Li, L., Zhang, F., Zhang, Z., Li, P., Bu, C.: Multi-fuzzy-objective graph pattern matching in big graph environments with reliability, trust and social relationship. World Wide Web 23(1), 649–669 (2020)
Article Google Scholar
Li, L., et al.: Bindex: a two-layered index for fast and robust scans. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 909–923 (2020)
Google Scholar
Möhring, R.H., Schilling, H., Schütz, B., Wagner, D., Willhalm, T.: Partitioning graphs to speedup Dijkstra’s algorithm. J. Exp. Algorith. (JEA) 11, 2–8 (2007)
MathSciNet MATH Google Scholar
Nannicini, G., Baptiste, P., Barbier, G., Krob, D., Liberti, L.: Fast paths in large-scale dynamic road networks. Comput. Optim. Appl. 45(1), 143–158 (2010)
Article MathSciNet MATH Google Scholar
Potamias, M., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distance estimation in large networks. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 867–876 (2009)
Google Scholar
Schulz, F., Wagner, D., Weihe, K.: Dijkstra’s algorithm on-line: an empirical case study from public railroad transport. J. Exp. Algorith. (JEA) 5, 12-es (2000)
Google Scholar
Seo, J., Guo, S., Lam, M.S.: Socialite: an efficient graph query language based on datalog. IEEE Trans. Knowl. Data Eng. 27(7), 1824–1837 (2015)
Article Google Scholar
Sommer, C.: Shortest-path queries in static networks. ACM Comput. Surv. (CSUR) 46(4), 1–31 (2014)
Article MATH Google Scholar
Tesfaye, B., Augsten, N., Pawlik, M., Böhlen, M.H., Jensen, C.S.: An efficient index for reachability queries in public transport networks. In: Darmont, J., Novikov, B., Wrembel, R. (eds.) ADBIS 2020. LNCS, vol. 12245, pp. 34–48. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54832-2_5
Chapter Google Scholar
Yuan, L., Qin, L., Zhang, W., Chang, L., Yang, J.: Index-based densest clique percolation community search in networks. IEEE Trans. Knowl. Data Eng. 30(5), 922–935 (2017)
Article Google Scholar
Zhuge, H., Liu, J., Feng, L., Sun, X., He, C.: Query routing in a peer-to-peer semantic link network. Comput. Intell. 21(2), 197–216 (2005)
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (61972198).

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Shibiao Zhu, Yuzhou Huang, Zirui Zhang & Xiaolin Qin

Authors

Shibiao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhou Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zirui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaolin Qin .

Editor information

Editors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Bohan Li
Newcastle University, Callaghan, NSW, Australia
Lin Yue
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Chuanqi Tao
Jinan University, Guangzhou, China
Xuming Han
Free University of Bozen-Bolzano, Bolzano, Italy
Diego Calvanese
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S., Huang, Y., Zhang, Z., Qin, X. (2023). SgIndex: An Index Structure Supporting Multiple Graph Queries. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13421. Springer, Cham. https://doi.org/10.1007/978-3-031-25158-0_45

Download citation

DOI: https://doi.org/10.1007/978-3-031-25158-0_45
Published: 10 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25157-3
Online ISBN: 978-3-031-25158-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics