Keywords

1 Introduction

Graph is a flexible data structure representing connections and relations among entities and concepts, which has been widely used in real world, including XML documents, cyber-physical systems, social networks, biological networks and traffic networks [1,2,3, 9, 12]. Nowadays, the size of graphs such as knowledge graphs and social networks is growing rapidly, which may contain billions of vertices and edges. k-hop reachability query in a directed graph is first discussed by Cheng et al. [1]. It asks whether a vertex u can reach v within k hops, i.e., whether there exists a directed path from u to v in the given directed graph and the path is not longer than k. Note that the input general directed graph is not necessary to be connected. Take the graph G in Fig. 1(a) as an example, vertex a can reach vertex e within 2 hops, but a cannot reach vertex d within 1 hop.

Fig. 1.
figure 1

Illustration of input graph and existing works

Efficiently answering k-hop reachability queries is helpful in many analytical tasks such as wireless networks, social networks and cyber-physical systems [1, 2, 12]. Several methods for k-hop reachability has been proposed, providing different techniques to solve this kind of queries. However, existing methods suffer some shortcomings, which make them not practical or general enough to answer k-hop reachability queries efficiently. To the best of our knowledge, k-reach [1, 2] is the only method aiming at dealing with k-hop reachability queries for general directed graph, which builds an index based on vertex cover of the graph. It is infeasible to build such an index for large graphs due to the huge space cost. Thus a partial coverage is employed in [2]. However, partial coverage technique is also not practical enough since most queries may fall into the worst case, which requires online BFS search.

A bunch of methods have been proposed to solve k-hop reachability queries in DAGs. BFSI-B [12] builds a compound index, containing both FELINE index [10] and breadth-first search index (BFSI). HT [3] works on 2-hop cover index, which selects some high-degree nodes in the DAG as hop nodes. Experiments have shown that both of them are practical and efficient to answer k-hop reachability queries. However, they are developed only for dealing with DAGs, which are not general enough since most graphs in real applications may have cycles, such as social networks and knowledge graphs.

A simple version of k-hop rechability query is reachability query. Given a graph G, reachability query can be taken as a specific case of k-hop reachability queries, since they are actually equivalent when \(k \ge \lambda (G)\), where \(\lambda (G)\) represents the length of the longest simple path in graph G. Note that for a general directed graph, we can obtain the corresponding DAG by condensing each strongly connected component (SCC) as a supernode, such that the reachability information in original graph can be completely preserved in the constructed DAG. Although lots of methods have been proposed to handle reachability queries [4, 6, 8, 10, 11, 13], they cannot be directly used for k-hop reachability queries since more information such as distance is missing in the transformation above.

We categorize the methods related to k-hop reachability queries [1,2,3,4, 6, 8, 10,11,12,13], as shown in Fig. 1(b). Clearly, right-top corner represents k-hop reachability in general directed graphs, which is the most general one. As discussed above, k-reach, the only existing method in this research area, is not practical enough to handle very large graphs. Hence, we develop a practical method named ESTI to answer k-hop reachability queries efficiently.

Our proposed approach, ESTI, follows the offline-and-online paradigm. It builds an index for a given graph in the offline phase, and answers arbitrary k-hop reachability queries in the online phase. In offline indexing process, both FELINE\(^+\) index and Extended Spanning Tree Index (ESTI) are constructed. We introduce the concept of Real Node and Virtual Node to build the extended spanning tree with both BFS and DFS. As for online querying, the offline index helps to answer k-hop reachability queries efficiently, and three pruning strategies are devised to further speed up query process.

Paper Organiztion. This paper is organized as follows. Section 3 explains the details of ESTI offline index, followed by the querying process as discussed in Sect. 4. Section 5 shows the results of experiments comparing ESTI with other k-hop reachability methods. In Sect. 6, some exciting works related to k-hop reachability queries are presented. Finally, Sect. 7 concludes the paper.

Fig. 2.
figure 2

Overview of Extended Spanning Tree Index (ESTI)

2 Problem Definition and Overview

2.1 Problem Definition

In this paper, the input general directed unweighted graph is represented as \(G=(V,E)\), where V denotes the set of vertices and E denotes the set of edges. |V| and |E| denote the number of vertices and edges in G, respectively. For any two vertices \(u, v\in V\) and \(u\ne v\), we say that u can reach v within k hops if there exists a directed path from u to v in G which is not longer than k. Let \(u \xrightarrow []{?k} v\) represent a query asking whether u can reach v within k hops in G.

2.2 Overview

ESTI follows the offline-and-online paradigm, and Fig. 2 presents the overview of our offline index structure. For better understanding, we briefly introduce our basic ideas and techniques for answering arbitrary k-hop reachability queries.

FELINE\(^+\) Index. Since reachablity is the neccessary condition for k-hop reachability, FELINE index [10] including two topological orders can be utilized to efficiently filter unreachable queries. The time cost of generating index in offline phase is \(O(|V|log|V|+|E|)\). In Sect. 3.1, we present an optimization named FELINE\(^+\) to speed up index generation, which costs \(O(|V|log(Deg^{(out)}_m)+|E|)\) time, where \(Deg^{(out)}_m\) is the maximum outgoing degree of a vertex.

Extended Spanning Tree Index. In order to preserve as much information as possible for answering queries, we introduce Virtual Root, Real Nodes and Virtual Nodes to constuct an extended spanning tree from the input graph G in Sect. 3.2. Also, pre- and postorders and global level are assigned to nodes in the tree, which helps to efficiently answer k-hop queries online.

Online Querying. Given arbitrary query \(u \xrightarrow []{?k} v\), the constructed index is utilized to directly return the correct answer or prune search space. In Sect. 4.2, three pruning strategies are developed to further accelerate online querying.

Fig. 3.
figure 3

FELINE index (XY) in DAG \(G_A\)

3 Offline Indexing

3.1 FELINE\(^+\) Index

If u cannot reach v in G, the answer of query \(u \xrightarrow []{?k} v\) is apparently False. To efficiently filter those unreachable queries in online querying phase, FELINE [10] condenses all strongly connected components (SCCs) in the given general directed graph G to obtain a DAG \(G_A\), and two topological orders X and Y are generated for each vertex in \(G_A\). Let \(X_v\) and \(Y_v\) denote the first and second topological order of a vertex v, respectively. If u can reach v, both \(X_u<X_v\) and \(Y_u<Y_v\) hold. Hence, for a query \(u \xrightarrow []{?k} v\), we can directly return the answer False if \(X_u>X_v\) or \(Y_u>Y_v\) in FELINE index.

In FELINE [10], X is calculated by a topological ordering algorithm, and Y coordinate is assigned by applying a heuristic decision. When assigning Y coordinate, let R be a set storing all roots in current DAG. FELINE iteratively runs the following procedures until all vertices in \(G_A\) have Y coordinates.

Step 1. Choose the root r from R with largest \(X_r\), assign r a coordinate \(Y_r\);

Step 2. Remove all of r’s outgoing edges. and some of its children may have no ancestors and become new roots. Thus, R should be updated.

Example 1

By condensing all SCCs of graph G in Fig. 1(a), its corresponding DAG \(G_A\) is shown in Fig. 3. After assigning X, we start to assign Y and \(R=\{a, c, f'\}\). Since \(X_{f'}=3\) is the largest one, \(Y_{f'}\) is assigned to be 0, and next we assign \(Y_c=1\) and \(Y_a=2\). When all edges connecting with \(b'\) are removed, we update \(R=R \ \cup \{b'\}\) to continue assigning Y coordinate to \(b'\). As for online querying, for instance, vertex a cannot reach vertex c since \(Y_a>Y_c\) in Fig. 3.

The time cost of condensing SCCs and generating X coordinate is \(O(|V|+|E|)\). Note that FELINE utilizes a max-heap to store all the current roots R, in which those roots are sorted in the descending order according to X. It takes O(1) to pop a root r from the max-heap in Step 1, and each vertex in \(G_A\) can only be inserted into R once which costs O(log|V|) time. Hence, the overall time cost of building index construction for FELINE is \(O(|V|log|V|+|E|)\).

In this paper, we propose an novel technique to accelerate Y coordinate generation, utilizing a simple array to store all the current roots R instead of a max-heap. Firstly, R is initialized by putting all the roots in original \(G_A\), making sure they are sorted in descending order w.r.t. X value. Then the following two steps are processed iteratively until all the vertices have Y coordinate.

Step 1. Pop the first element r from the array R and assign its Y coordinate.

Step 2. Remove all of r’s outgoing edges. Sort those new roots w.r.t descending X value, then insert them in the front of array R, while preserving the order.

Theorem 1

The order of elements in array R is always the same as the descending order of their X value.

figure a

Proof

At first, array R is initialized with all roots in original \(G_A\), which are sorted in the descending order w.r.t. X value. Assume that elements in array R are in the descending order of X value. When we pop the first element r from array R to assign \(Y_r\), \(X_r\ge X_v\) holds for any vertex v in array R. After removing r’s outgoing edges, some of its children w may become new roots and \(X_w>X_r\) must hold. Thus, every w has larger X than any v in array R. After sorting those new roots w in descending X value and inserting them in the front of array R, all the vertices in array R are still in their descending X order.    \(\square \)

The enhanced algorithm, denoted by FELINE\(^+\), for accelerating FELINE is shown in Algorithm 1. When generating Y coordinate, according to Theorem 1, the first element r of array R always has the largest \(X_r\) value in R, and it actually constructs the same index as FELINE. Note that to make sure the initial roots in arrary R are in descending order w.r.t. X value, we only need to reverse the initial root queue of X coordinate generation process, because their X values are generated following the order of it. Hence, the initialization time of array R is linear to the number of roots in original \(G_A\). When processing each current root r, sorting the new roots takes O(|w|log|w|), where |w| is the number of new roots obtained by removing r’s outgoing edges. Since each vertex in \(G_A\) can be a new root only once, the time cost of generating Y coordinate is \(O(|V|log(Deg^{(out)}_m)+|E|)\), where \(Deg^{(out)}_m\) is the max number of outgoing neighbors of a vertex and \(|w|\le Deg^{(out)}_m\) always holds.

The total time cost of building index for FELINE\(^+\) is \(O(|V|log(Deg^{(out)}_m)+|E|)\). Theoretically, since \(Deg^{(out)}_m\) is much smaller than |V| in many graphs, our approach is faster than the original FELINE whose time cost is \(O(|V|log|V|+|E|)\). Experiments confirm that the proposed optimization technique significantly accelerates the index construction for FELINE, as shown in Sect. 5.2.

3.2 Extended Spanning Tree Index for General Directed Graph

Preliminary. We first briefly introduce pre- and postorder index and global level for a tree, which have been used in GRIPP [9] and BFSI-B [12]. Note that BFSI-B applies min-post strategy, which actually has the same effect as pre- and postorders. For any vertex v in the tree, \(pre_v\) and \(post_v\) represent the pre- and postorder index of v, respectively. And \(level_v\) is the global level of v, i.e., the distance from the tree root to v. \(pre_v\) and \(post_v\) are generated during the DFS traversal, while \(level_v\) is generated during the BFS traversal.

Example 2

Figure 4(a) illustrates the three labels. Following the visiting order in DFS, we start from root a and set \(pre_a\) to 0. Then we visit b and c and set \(pre_b\) and \(pre_c\) to 1 and 2, respectively. After returning from c, we set \(post_c\) to 3. The process proceeds until all nodes have been visited. Each node is assigned both pre- and postorder index following the DFS. As for level index, \(level_a\) is set to be 0 and we can assign level to other vertices following the BFS.

We say that \((pre_v, post_v) \subset (pre_u, post_u)\) iff \(pre_v \ge pre_u \wedge post_v \le post_u\). Based on the constructed index \((pre_v, post_v, level_v)\) discussed above, Theorem 2 holds in the tree, and query \(u \xrightarrow []{?k} v\) can be efficiently answered. For example, in Fig. 4(a) a can reach d in 2 hops, since \((4, 5) \subset (0, 11)\) and \(level_d-level_a=2\).

Fig. 4.
figure 4

Illustration of \((pre_v, post_v, level_v)\) index and Virtual Root

Theorem 2

Given two vertices u and v in tree T, u can reach v within k hops if \((pre_v, post_v) \subset (pre_u, post_u) \wedge level_v-level_u \in (0, k]\).

Proof

According to the process of pre- and postorder generation, \((pre_v, post_v) \subset (pre_u, post_u)\) indicates that v is in the subtree whose root is u. \(level_v-level_u \in (0, k]\) implies that there is a path from u to v which is not longer than k.    \(\square \)

Clearly, if the input graph is a tree, both time and space cost for building the index are \(O(|V|+|E|)\) and it only takes O(1) for online query. However, when the input general directed graph G is not a tree, to make it practical and efficient enough for answering k-hop reachability queries, we introduce Virtual Root, Real Node and Virtual Node to transform G into an Extended Spanning Tree (EST). Note that our method is quite different from existing approaches like GRIPP [9] and BFSI-B [12]. GRIPP solves reachability queries while ignores distance information which is necessary for answering k-hop reachability queries, and BFSI-B is developed for only dealing with DAGs. However, most graphs in real life have cycles and BFSI-B cannot directly work on these graphs.

Virtual Root. Since the given graph G may not be connected, e.g., the graph in Fig. 1(a), we add a virtual root \(V_R\) to make sure that it can reach all vertices in G. We first add an edge from \(V_R\) to all the vertices which have no predecessors, then explore from \(V_R\) to mark all of its descendants visited. The second step is to randomly select an unvisited vertex v, and add an edge from \(V_R\) to v while all of v’s descendants are marked visited. We repeat the second step until all vertices have been visited. Take graph G in Fig. 1(a) as an example. After adding a virtual root for it, we obtain a new graph \(G'\) in Fig. 4(b).

Fig. 5.
figure 5

Extended spanning tree of G and \((pre_v, post_v, level_v)\) index

Real and Virtual Nodes. When starting BFS from virtual root \(V_R\), we may encounter endless loop since there may exist cycles in \(G'\), or some visited vertices since they have multiple incoming edges. To solve this problem, we introduce Real Nodes and Virtual Nodes. In BFS process, if vertex v has never been visited, it will be added to the spanning tree as a Real Node and we will continue to visit its successors. If vertex v has been visited, it will be added to the tree as a Virtual Node while its successors will not be explored again. Following the above definition of Real Node and Virtual Node, we can construct an extended spanning tree from graph \(G'\), as shown in Example 3. Also, Theorem 3 holds.

Example 3

In Fig. 4(b), we start BFS from r and add real nodes for r, a, c, f, b, d and g. When exploring from b to visit d, we create a virtual node for d since it has been visited before. Figure 5 is the extended spanning tree of \(G'\).

Theorem 3

In extended spanning tree, each vertex v in graph \(G'\) must have exactly one real node. The total number of real and virtual nodes in this tree is equal to the number of edges in \(G'\) plus 1.

Proof

Since virtual root \(V_R\) can reach all vertices in \(G'\) and we start BFS from \(V_R\) to construct the extended spanning tree, a real node is created for each vertex v in \(G'\) when it is visited for the first time. When v is visited again, we only create a virtual node for it. Hence, each v in \(G'\) must have exactly one real node.

At the beginning of BFS, we create a real node for virtual root \(V_R\). As for the other vertices v in \(G'\), a real node or virtual node will be created for v only when we explore from its incoming neighbor. Hence, the number of real and virtual nodes in this tree is equal to the number of edges in \(G'\) plus one, where the additional one is the real node representing virtual root \(V_R\).    \(\square \)

Index Generation. Recall that in a tree, the index of vertex v consists of \(pre_v\), \(post_v\) and \(level_v\). When constructing the extended spanning tree from graph \(G'\), we have already run BFS in the tree, and level index will also be generated for all the nodes. Next, we explore the whole tree by DFS and assign each vertex with pre- and postorder index. Take the graph \(G'\) in Fig. 4(b) as an example. The index of its extended spanning tree is shown in Fig. 5. After assigning the above index, Theorem 4 holds for all the real and virtual nodes in the tree.

Theorem 4

If vertex v of \(G'\) has virtual nodes in the extended spanning tree, denote its unique real node as \(v'_r\). For any virtual node \(v'_i\) of v, \(level_{v'_i} \ge level_{v'_r}\).

Proof

When construting the extended spanning tree by BFS, all the virtual nodes of v are created after its real node is created. Hence, based on the exploration order of BFS, \(level_{v'_i} \ge level_{v'_r}\).    \(\square \)

Let \(|V'|\) and \(|E'|\) denote the number of vertices and edges in \(G'\), respectively. When generating \(G'\) from original graph G, we add a virtual root \(V_R\) and at most |V| edges to connect vertices in G. Thus, \(O(|V'|+|E'|)=O(|V|+|E|)\).

figure b

The time and space comlexity of adding a virtual root is \(O(|V|+|E|)\), since each vertex and edge is visited once. When constructing the extended spanning tree, each edge in \(G'\) is visited once since we explore from vertex v only when its unique real node is created. According to Theorem 3, it takes both time and space cost \(O(|V|+|E|)\) to create all real and virtual nodes. And both BFS and DFS also take the time and space cost \(O(|V|+|E|)\). Hence, the overall time and space cost for constructing the extended spanning tree and the three labels are \(O(|V|+|E|)\), which indicates that it is feasible even for very large graphs.

3.3 Summary of Offline Indexing

The index of our proposed ESTI method consists of two parts: FELINE\(^+\) (Sect. 3.1) and the extended spanning tree (Sect. 3.2). The whole generation process is shown in Algorithm 2. Recall that building FELINE\(^+\) index takes \(O(|V|log(Deg^{(out)}_m)+|E|)\) time and O(|V|) space, where \(Deg^{(out)}_m\) is the maximum outgoing degree in \(G_A\). And the time and space cost of constructing the extended spanning tree and three labels are both \(O(|V|+|E|)\). Hence, the overall index constrution time of ESTI is \(O(|V|log(Deg^{(out)}_m)+|E|)\), and index size is \(O(|V|+|E|)\). Next, we will show how the constructed index supports efficient online k-hop reachability queries.

figure c

4 Online Querying

4.1 Basic Query Process

After constructing ESTI index (Sect. 3) for the input graph G, we can utilize the index to answer k-hop reachability queries online. Given a query \(u \xrightarrow []{?k} v\), if \(u=v\) or \(k \le 0\) we can directly return the answer. Assume that \(u \ne v\) and \(k>0\), the basic query function is shown in Algorithm 3.

As discussed in Sect. 3.1, in Line 1–2, if the topological order X (or Y) of u’s corresponding vertex in DAG \(G_A\) is larger than v’s X (or Y), we can safely return False. In Line 3–6, the pre- and postorders of real and virtual nodes are compared. Note that in Line 7–15, we run DFS only when \(k>1\) (Line 7) because the exploration will never return True when \(k\le 1\). If \(k=1\) the answer from Line 3–6 is the final answer, and \(k=0\) is impossible since the initial input assumes that \(k>0\) while funtion Query is invoked only when \(k>1\).

Example 4

Given the constructed index in Fig. 5, for query \(c \xrightarrow []{?3} b\), we invoke Query(c, b, 3). The pre- and postorder of c’s Real Node is (11, 16), but the real node of b has index \((2,9) \not \subset (11,16)\) and its virtual node has index \((6,7) \not \subset (11,16)\). Then Query(d, b, 2) is invoked, which results in calling Query(e, b, 1). Luckily, b’s virtual node has index \((6,7) \subset (5,8)\) and the function returns True.

To further improve the performance of online querying, we develop three pruning strategies based on properties of the extended spanning tree.

4.2 Pruning Strategies

Prune I. For query \(u \xrightarrow []{?k} v\), denote \(u'_r\), \(v'_r\) as the real node of u and v, respectively. Prune I strategy utilizes Theorem 5 to stop redundant exploration in advance, i.e., Query(uvk) will directly return False if \(level_{v'_r} - level_{u'_r} > k\).

Theorem 5

If \(level_{v'_r} - level_{u'_r} > k\), u cannot reach v within k hops.

Proof

Note that as discussed above, we never invoke Query(uvk) s.t. \(k=0\).

(Case 1). When \(k=1\), assume that \(level_{v'_r} - level_{u'_r} > 1\). If u can reach v within 1 hop, v has a real or virtual node \(v'\) which is the child of \(u'_r\) and \(level_{v'}=level_{u'_r}+1\). According to Theorem 4, \(level_{v'} \ge level_{v'_r}\) indicates that \(level_{v'_r} - level_{u'_r} \le level_{v'} - level_{u'_r} = 1\), which contradicts the assumption.

(Case 2). When \(k>1\), in function Query(uvk), Line 3–6 will never return True since \(level_{v'} \ge level_{v'_r}\) and \(level_{v'} - level_{u'_r} \ge level_{v'_r} - level_{u'_r} > k\). Hence we need to invoke \(Query (w, v, k-1)\) or \(Query (u, w, k-1)\). For \(Query (w, v, k-1)\), since the real or virtual node \(w'\) is a child of \(u'_r\) in the tree, the real node of w satisfies \(level_{w'_r} \le level_{w'} = level_{u'_r}+1\). Thus, we have \(level_{v'_r} - level_{w'_r} \ge level_{v'_r} - level_{u'_r} -1 > k-1\), and \(Query (w, v, k-1)\) falls into Case 1 or Case 2 again. For \(Query (u, w, k-1)\), since \(w'_r\) is the parent of one of the real or virtual node \(v'\) in the tree, \(w'_r\) satisfies \(level_{w'_r} = level_{v'}-1 \ge level_{v'_r}-1\). Thus, we have \(level_{w'_r} - level_{u'_r} \ge level_{v'_r} - level_{u'_r} -1 > k-1\), and \(Query (u, w, k-1)\) also falls into Case 1 or Case 2 again.

Hence, if \(level_{v'_r} - level_{u'_r} > k\), u cannot reach v within k hops.    \(\square \)

Example 5

In Fig. 5, for query \(f \xrightarrow []{?1} e\), both real and virtual nodes of e have level 3, while the real node of f has level 1. Since \(3-1>k=1\), we return False.

Prune II. In Line 3–6 of Algorithm 3, we iterate all real and virtual nodes \(v'\) to compare \((pre_{v'}, post_{v'})\) with \((pre_{u'_r}, post_{u'_r})\), where \(u'_r\) is the unique real node of u. From the generation process of pre- and postorder index, \((pre_i, post_i)\) and \((pre_j, post_j)\) can never overlap for any vertex i and j. Instead of utilizing \((pre_{v'}, post_{v'})\), we can only check whether \(pre_{v'} \in (pre_{u'_r}, post_{u'_r})\). Hence, \(post_{v'_i}\) index of all virtual nodes \(v'_i\) will never be used in online phase, which means that we do not need to store post index for all virtual nodes in offline phase.

Moreover, when vertex v has lots of virtual nodes \(v'_i\), checking whether \(pre_{v'_i} \in (pre_{u'_r}, post_{u'_r})\) is not efficient enough. Instead of iterating them one by one for comparison, if all the virtual nodes \(v'_i\) have been sorted w.r.t. their \(pre_{v'_i}\) in offline phase, we can spend only \(log(|v'_i|)\) to find the first virtual node whose \(pre_{v'_i}>pre_{u'_r}\) and start iterating from it until \(pre_{v'_i}>post_{u'_r}\), where \(|v'_i|\) is the number of virtual nodes representing v. Note that the number of virtual nodes representing vertex v is equal to its incoming degree in \(G'\) minus 1, since in the extended spanning tree construction (Sect. 3.2), we create a virtual node for v only when v is visited again from an incoming neighbor. Hence, sorting all virtual nodes \(v'_i\) w.r.t. \(pre_{v'_i}\) for each vertex v costs \(O(|E|log(Deg^{(in)}_m))\), where \(Deg^{(in)}_m\) is the maximum incoming degree of a vertex. And the overall time cost of offline indexing is \(O(|V|log(Deg^{(out)}_m)+|E|log(Deg^{(in)}_m))\) if Prune II strategy is used in online phase.

Prune III. For each real node \(u'_r\) of u, while performing DFS traversal in offline index construction, we can find out \(dist_u\) which represents the distance from \(u'_r\) to the nearest virtual node \(w'_i\) among all its successors in extended spanning tree. Given dist index for every real node in the tree, for query \(u \xrightarrow []{?k} v\), if \(dist_{u}\ge k\), we do not have to explore u’s successors. That is because when exploring from \(u'_r\) in the tree, virtual nodes can only exists in the \(k^{th}\) hop. Assume that u can reach v within k hops. When one of v’s real or virtual node is in the subtree rooted at \(u'_r\), the query will return True in Line 3–6 in Algorithm 3. When all of v’s real and virtual nodes are not in the subtree rooted at \(u'_r\), there must exist a virtual node \(w'_i\) which can jump out of the subtree to reach v. Note that \(level_{w'_i}-level_{u'_r}<k\) holds, or it needs more than k hops from u to v. However, it contradicts \(dist_{u}\ge k\) since the distance from \(u'_r\) to \(w'_i\) is smaller than k.

figure d

Example 6

In Fig. 5, for query \(f \xrightarrow []{?3} c\), the pre- and postorder index of c is not in the interval of f’s index, i.e., \((11,16) \not \subset (17,24)\). Next, instead of exploring g and h, we can safely return False directly since \(dist_{f}=k=3\).

4.3 Summary of Online Querying

After utilizing the three pruning strategies as discussed in Sect. 4.2, the ESTI query function Query(uvk) is shown in Algorithm 4. Though in the worst case we still need to explore the whole graph, ESTI index still helps a lot for pruning online search space. Section 5 will demonstrate its practical efficiency.

5 Experiments

We evaluate the effectiveness and efficiency of the proposed ESTI method by carrying extensive experiments on both small and large graphs. All the experiments are conducted on a Linux machine with an Intel(R) Xeon(R) E5-2678 v3 CPU @2.5GHz and 220G RAM, and all algorithms are implemented using C++ and complied by G++ 5.4.0 with -O3 Optimization. Each experiment has been run for 10 times and the results are consistent among 10 executions. In this section, we report the average value from 10 executions of each experiment.

5.1 Datasets

A variety of real graphs are used in our experiments, as shown in Table 1. kegg, amaze, nasa, go, mtbrv, anthra, ecoo, agrocyc and human are small graphs from different sources [13]. p2p-Gnutella graphs are 8 snapshots of Gnutella peer to peer file network, while soc-Epinions1 is a who-trust-whom online social network [5]. As for large graphs, 10go-uniprot, go-uniprots, uniprotenc22m, uniprotenc100m and uniprotenc150m come from Uniprot database. 10cit-Patent, 05cit-Patent, cit-Patents and citeseer are citation networks [3]. WikiTalk is a Wikipedia communication network, while soc-Pokec and twitter are large-scale social networks [5, 7]. govwild and yago are RDF datasets [7].

Table 1. Statistics of datasets
Fig. 6.
figure 6

Index construction time of FELINE and FELINE\(^+\)

5.2 Performance of FELINE\(^+\)

As discussed in Sect. 3.1, we propose an optimized approach named FELINE\(^+\) to accelerate FELINE index generation, while obtaining exactly the same index as FELINE. Figure 6 shows the index construction time, in which FELINE\(^+\) significantly speeds up the construction process in all graphs.

5.3 Queries with Different k

The efficiency of online querying is crucial for k-hop reachability query answering, and different values of k can significantly affect the performance. We report the query time of the proposed ESTI method with different values of k (\(k=2, 4, 8\)) in Table 2, comparing it with the state-of-art k-reach approach [2]. For each k, we generate a million queries with randomly selected start and target vertices. Note that k-reach requires a fixed budget b to construct the partial vertex cover and we set \(b=1000\), which is the same as the budget used in [2].

When the value of k increases, the time cost of both k-reach and ESTI also tend to increase, because a larger k indicates a larger search space when the built index cannot directly answer a query. We notice that most of queries fall into the worst case in k-reach, which needs traditional BFS search over the whole graph. Note that when \(k=4\) and \(k=8\), k-reach exceeds our time limit (4 h) in graph soc-Pokec. Clearly, ESTI is faster than k-reach over all graphs when \(k=2\) and \(k=4\), and it also beats k-reach in most graphs except for graph WikiTalk. Note that the diameter of WikiTalk is 9, which is relatively small and is quite closed to \(k=8\). In practice, k will not be too large for social networks.

Table 2. Query time (ms) of different k

5.4 Comparison with the State-of-art

As discussed in Sect. 1, k-reach [1, 2] is the only method solving k-hop reachablity queries on general directed graphs. We conduct experiments on both small and large graphs to compare the proposed ESTI method with k-reach. For each graph, we randomly generate a million queries while values of k are generated following the distance distribution of all reachable pairs. Their index size, index construction time and query time are reported in Table 3 and 4.

Table 3. Index size, index construction time and query time on small graphs
Table 4. Index size, index construction time and query time on large graphs

The results in Table 3 shows that ESTI completely beats k-reach in all small graphs. Note that the budget of k-reach is also set to be 1000. ESTI constructs smaller index and is approximately an order of magnitude faster when building index for most small graphs. As for online querying, ESTI costs significantly less time. It is even more than a hundred times faster in graph soc-Epinions1.

For large graphs, we compare our ESTI method with k-reach in Table 4, where the budget of k-reach are set to be 1,000 and 50,000, respectively. Note that k-reach exceeds our time limit (4 h) on graph soc-Pokec. When answering queries online, ESTI method costs much less time over all large graphs. Though ESTI needs longer index construction time on most graphs, we believe that the efficiency of online query processing is more important than offline indexing. Theorectically, the overall time cost of ESTI offline indexing is \(O(|V|log(Deg^{(out)}_m)+|E|log(Deg^{(in)}_m))\), which is a stable bound.

The index size of ESTI is \(O(|V|+|E|)\), which is strictly linear to the size of input graph. However, k-reach with budget 1,000 has the smallest index size on some large graphs, and it also costs a lot of time to answer queries online. It seems that 1,000 is a relatively small budget, which may limit the querying performance of k-reach. But when the budget is set to be 50,000, k-reach has larger index size than ESTI in many graphs, while it still cost more time in online querying process. Hence, the overall query answering ability of ESTI method is also better over large graphs.

6 Related Works

6.1 Reachabilty Query

Before Cheng et al. [1] first proposed k-hop reachability problem, lots of studies about reachability query over large graphs have been carried. Reachability query is a special case of k-hop reachability query when \(k=\infty \). Since the lack of distance information, existing reachability query methods including BFL [8], IP+ [11], GRIPP [9], PWAH8 [6], GRAIL [13] and Path-Tree Cover [4], etc. are not sufficient to answer k-hop reachability queries.

6.2 k-hop Reachabilty Query

To answer k-hop reachability problems, a naive idea is to process BFS or DFS in given directed graph. Both BFS and DFS don’t need any pre-computed index, but they are not efficient when the graph becomes very large, since lots of search branches will be expanded while exploring in the original large graph. In contrast, storing the shortest distance between each pair of vertices helps to answer any queries within O(1) time. However, in order to compute and store such distance, performing BFS from every vertex in G costs \(O(|V|(|V|+|E|))\) time and \(O(|V|^2)\) space, which is also inefficient and even infeasible for large graphs.

Vertex Cover Based Method. Vertex cover is a subset of all the vertices in a given graph G, making sure that for each edge in G, at least one of the two vertices connected by this edge is contained in the vertex cover. k-reach [1, 2] makes good use of vertex cover, and runs BFS in the subgraph constructed from vertex cover to build index. Though it is proved efficient in small graphs, when dealing with larger graphs, k-reach still costs infeasible index time and space.

To overcome this drawback, Cheng et al. also proposes a partial vertex cover [2] to make a trade-off between offline index and online query performance. Though it can work on very large graphs, the partial vertex cover index cannot answer a large proportion of online queries directly. In fact, traditional online BFS would be invoked for more than \(95\%\) of the queries. Hence, it is still not practical enough for answering k-hop reachability queries efficiently.

Methods Work on DAGs. To improve index efficiency, Xie et al. [12] proposed BFSI-B Algorithm, which uses the breadth-first spanning tree to build BFSI index, including min-post index and global BFS level TLE. Also, FELINE index [10] is adopted to filter those unreachable queries. Another method developed for DAGs is HT [3], which adopts the idea of partial 2-hop cover. In its indexing process, vertices with high degree are selected as hop nodes. Both backward and forward BFS are started from each hop node u. When visiting a new vertex v, current hop node’s id u and the distance from u to v will be stored as the index of v. Topological order is also used for filtering unreachable queries.

Though both BFSI-B and HT are more efficient than k-reach, they can only work for DAGs and cannot directly deal with directed graphs with cycles. Also, more efficient pruning strategies need to be utilized to further improve online querying performance.

Algorithms for Distributed Systems. To deal with multiple k-hop reachability queries concurrently on distributed infrastructures, C-Graph [14] focuses on improving both disk and network I/O performance when performing BFS. Compared with developing methods for a single machine, designing optimizations for distributed systems is a significantly different task.

7 Conclusion

We propose ESTI method to efficiently solve k-hop reachability queries for general directed graphs, which builds an extended spanning tree in offline phase and utilizes three pruning strategies to accelarte query processing. Also, an optimization named FELINE\(^+\) is developed to speeds up FELINE index generation, which helps to effectively filter unreachable queries in online searching. We also conduct extensive experiments to compare ESTI with the state-of-art method k-reach. Our experiment results confirm that on most graphs the overall performance of ESTI is the best, and in online querying it is significantly faster.