1 Introduction

1.1 Background

Graph pattern matching (GPM) identifies matching subgraphs M(GP,GD) in a data graph GD for a given pattern graph GP. It has become increasingly used in computer vision [4], chemical structure [23] and social networks [20]. GPM is typically defined in terms of subgraph isomorphism [30]. That is, a graph GD is a match of GP such that there exists a bijective function f from the nodes of GP to the nodes of GD, (a) for each node v in GP, v and f(v) have the same label, and (b) there exists an edge from v to \(v^{\prime }\) in GP if and only if \((f(v),f(v^{\prime }))\) is an edge in GD. This makes graph pattern matching NP-complete [15], which is often too restrictive to actual applications. Therefore, Graph Simulation [22] has been proposed with fewer restrictions. A data graph GD is a match of GP such that there exists a binary relation \(R\subseteq V_{q} \times V\), (1) for each (u,v) ∈ R, u and v have the same label; and (2) for each edge \((u, u^{\prime })\) in GP, there exists an edge \((v,v^{\prime })\) in GD such that \((u^{\prime }, v^{\prime }) \in R\). In contrast to subgraph isomorphism, graph Simulation [22] has less restrictions but more capacity to extract more useful subgraphs with better efficiency.

Example 1

Consider GD and GP in Figure 1, where each node denotes a person, e.g., project manager (PM), database developer (DB) and programmer (PRG). Moreover, each edge indicates a supervision relationship, e.g., edge (PM1, PRG1) indicates that PM1 supervises PRG1. When graph pattern matching is defined in terms of subgraph isomorphism, GD1 is a matching of GP. However, when graph pattern matching is defined in terms of graph simulation, GD1, GD2 and GD3 are the matchings of GP.

Figure 1
figure 1

Data graph GD and Pattern graph GP

Graph simulation aims to extract all matching subgraphs M(GP,GD) from GD. However, many applications often need to find Top-K matches of a designated vd in terms of a given GP, rather than the entire set matching. Such as expert recommendation [21, 24] and egocentric search [5].

Example 2

Recall GD and GP in Figure 1. Suppose a company issues a graph search query to find the matches of PM in GD based on the pattern GP, where PM needs to supervise both a DB and a PRG, and moreover, the DB needs to work under the supervision of a PRG. The requirement for the matches of PM is expressed as a graph pattern GP. When a graph pattern matching is defined in terms of graph simulation in GD, the corresponding designated node can be identified, like PM1, PM2, and PM3 in GD1, GD2 and GD3.

In such a query introduced in the above example, It is unnecessary and too costly to compute the entire matching set M(GP,GD). Thus Top-K GPM based on the graph simulation (TopKP) is proposed [10], where given a pattern graph GP, a data Graph GD and a designed node vd in GP, it is to find Top-K matching nodes of vd ranked by a quality function in GD.

1.2 Problem and challenges

As shown in [17], many applications, for example, crowd-sourcing travel [19] and social network based e-commerce [15], people are willing to incorporate social contexts like the social relationships, the social trust and the social positions in vd finding, which have significant influence on people’s collaborations and decision making [14]. For example, the experts recommendation in social graphs, people prefer to find an expert who has intimate relationships with members of the team led by the expert.

However, the existing Top-K designated nodes finding method, like TopKP [10] only takes the number of nodes included in a matching into consideration, where the more the nodes in a matching, the better the quality of vd. This strategy in designated nodes finding neglects the influence of social contexts, and thus can hardly find the high quality vd in social graphs.

Example 3

Consider GD and GP in Figure 2, suppose PM3 is the father of DB1, and PM2 is the classmate of DB1. TopKP [10] will regard PM2 as the best matching of the designated node, PM, in GP, as the corresponding matching g2 has 4 nodes except PM2. However, the relationship between PM3 and DB1 is more intimate than that between PM2 and DB1. Thus, PM2 may not be the best one in designated node finding.

Figure 2
figure 2

Designated node finding in terms of graph simulation

This example inspires a new type of conText-Aware Graph pattern based Top-K designed nodes finding (TAG-K) problem, where we need to consider social contexts, i.e., social trust, social relationships and the impact of social position. Based on the theories in Social Science [1], these social contexts have significant influence on peoples decision making. In real applications, people are willing to incorporate the requirements of these social contexts in graph pattern matching, forming the Multi-Constrained Simulation (MCS) [17]. GPM in terms of MCS is NP-Complete as it subsumes the classical NP-Complete multi-constrained path selection problem [15]. Thus, TAG-K aims to find the Top-K designated nodes in terms of MCS, which brings the challenges of the efficiency and effectiveness issues.

In this paper, we propose a TAG-K problem, where given a pattern graph GP, which has multiple constraints on social contexts, a data graph GD, which contains social contexts, and a designated node vd in GP, it is to find Top-K matches of vd included in the graph pattern matching results, M(GP, GD) in terms of MCS. TAG-K covers the multi-constrained GPM problem, which is NP-Complete [17]. So, the main challenge of our work is to develop an efficient and effective approximation method to support the TAG-K query. Our contributions are summarized as below.

  • We first propose a TAG-K problem, where, in the graph pattern based designated node finding, TAG-K considers the constraints of social contexts, like social trust, social relationships and the impact of social position.

  • We then propose a Multi-Attribute Tree (MA-Tree) index to record the labels, outdegree and indegree of nodes in GD, which can get candidates of vd efficiently. Moreover, we propose an SSC-Index, which records more details of the decedents of a node in a Strong Social Component, where the nodes have strong social relation and highly impact of social position, pruning unpromising nodes effectively.

  • We develop an efficient and effective algorithm, called A-TAG-K, which incorporates MA-Tree and SSC-Index. A-TAG-K can deliver a set containing Top-K designated nodes with \(O(E_{P}{N^{2}_{D}} + E_{P}N_{D})\) time complex, where ND is the number of nodes in GD and EP is the number of edges in GP.

  • We conduct experiments onto five real social network datasets, and the experimental results demonstrate our A-TAG-K greatly outperforms the existing methods in both efficiency and effectiveness.

The rest of this paper is structured as follows. In Section 2, we present related work about Top-K GPM problem. Section 3 is the preliminary of our work. Section 4 presents two indices, MA-Tree and SSC-Index. Section 5 presents the A-TAG-K algorithm. Experimental results are described in Section 6. Finally, we conclude the paper in Section 7.

2 Related work

Top-K GPM problem has already been widely studied in the literature, which can be classified into (1) the isomorphism-based Top-K GPM. (2) the simulation-based Top-K GPM. Below we analyze them in detail.

Isomorphism-based Top-K GPM

This type of Top-K GPM is based on the subgraph isomorphism [30]. Tian et al. [29] propose the concept of approximate subgraph matching, which allows node mismatches and node/edge insertions and deletions. For coping with approximate subgraph matching problem, an index-based method is presented in [29], called TALE. In addition, Ding et al. [7] define the matching similarity between a data graph and a query graph to order results. Built on NH-index in [29], Ding et al. [7] employ the index to prune unpromising candidate nodes for each query node. Furthermore, Zhu et al. [35] consider the entire structure matching rather than substructure matching, and propose an algorithm to respond to the Top-K graph similarity query using two distance lower bounds with different computational costs. By using some typical hashing methods [25, 27, 31, 34], we can improve the efficiency of GPM. But sometimes it suffers the low effectiveness due to missing some important features of original data graphs.

Simulation-based Top-K GPM

Existing isomorphism-based Top-K GPM methods still too strict to be used in some applications, e.g., finding social experts [11] and project organization [10]. Based on graph simulation [22], Fan et al. [10] propose a novel Top-K graph pattern matching method supporting a designated pattern node vd, which can find the Top-K matches of vd without computing the entire graph matching results. In addition, Chang et al. [3] study the problem of Top-K tree pattern matching, where the edges in the tree are mapped to the shortest paths in G connecting the corresponding nodes. A novel and optimal enumeration paradigm [3] has been presented, which is based on the principle of Lawler’s procedure. Furthermore, Gao et. al. [26] propose a graph learning method based on the graph simulations, where multiple features of a graph are considered.

The existing Top-K GPM methods do not consider the social contexts in GPM in social graphs. As indicated in [17], such queries are common in social network based applications, like crowd-sourcing travel [19] and social network based e-commerce [15], which motives us to develop a new type of context-aware graph pattern based designated nodes finding method.

3 Preliminaries

In this section, we first introduce the Multi-Constrained Simulation (MCS) [15], and then we define the TAG-K problem, and propose the ranking function in Top-K nodes finding.

3.1 Multi-constrained simulation (MCS)

Given a data graph GD, and a pattern graph GP, MCS [15] provides conditions, which subgraph must satisfy, if subgraph matches the pattern graph.

Data graph

A data graph is a Contextual Social Graph (CSG) [15], which is a labeled directed graph G = (V,E,LV,LE), where

  • V is a set of vertices;

  • E is a set of edges, and (ui,uj) ∈ E denotes a directed edge from vertex ui to vertex uj;

  • LV is a function defined on V such that for each vertex u in V, LV (u) is a set of labels for v. Intuitively, the vertex labels may for example represent social roles in a specific domain;

  • LE is a function defined on E such that for each edge (ui,uj) in E, LE(ui,uj) is a set of labels for (ui,uj), like social relationships and social trust in a specific domain.

Example 4

As shown in the data graph in Figure 3b, where each vertex is associated with a role impact factor, denoted as ρ ∈ [0,1], to illustrate the impact of a participant in a specific domain, which is determined by the expertise of the participant. ρ = 1 indicates that the people is a domain expert while ρ = 0 indicates that the people has no knowledge in that domain. Moreover, each edge is associated with social trust, denoted as T ∈ [0,1], and social intimacy degree, denoted as r ∈ [0,1], to illustrate trust and intimacy social relationships between participants. T,r and ρ are called social impact factors, whose values can be extracted by using the data mining techniques [13, 15, 16, 18, 33]. For example, in academic social networks formed by large databases of Computer Science literature (e.g., DBLPFootnote 1 or ACM Digital LibraryFootnote 2), the social relationships between two scholars (e.g., co-authors, a supervisor and his/her students) and the role of scholars (e.g., a professor in the field of data mining) can be mined from publications or their homepages. The social intimacy degree and role impact factor values can be calculated as an example by applying the PageRank model [28].

Figure 3
figure 3

Data graph GD and pattern graph GP

Based on theories in Social Psychology [1], we adopt the multiplication method to aggregate T and r values of a path, and adopt the average method to aggregate the ρ values of the vertices in a path. The details of the aggregation method have been discussed in [15]. The aggregated values of a path p is denoted as AS(p) =< AT(p), Ar(p), Aρ(p) >. If each of the aggregated social impact factor value of p is greater than the corresponding one of path \(p^{\prime }\), then p dominates \(p^{\prime }\), which is denoted as \(p\propto p^{\prime }\).

Pattern graph

A pattern graph is defined as GP = (Vp, Ep, fv, fe, se). (1) Vp and Ep are the set of vertices and the set of directed edges, respectively; fv(v) is the node label of v; (2) fe(vi,vj) is the bounded length of (vi,vj), represented by L; (3) se(vi,vj) is the multiple constraints of the aggregated social impact factor values of (vi,vj) represented by λT,λr and λρ, which are in the scope of [0,1];

Example 5

As showed in the pattern graph in Figure 3a, where the multiple constraints, i.e., λT,λr, λρ and L are given to edge (PM,PRG) which must be satisfied in the graph pattern matching.

GPM based on multi-constrained simulation (MCS)

GD matches GP via MCS, if there exists a binary relation SVq × V such that (1) for all vVq, there exists a node vV such that (v,u) ∈ S; (2) for each edge \((v,v^{\prime })\) in Eq, there exist nonempty paths from u to \(u^{\prime }\) in G, and \(length(u,u^{\prime }) \leq L\), if \(f_{e}(v, v^{\prime }) = L\); (2) \(T(u,u^{\prime }) > \lambda _{T}\), \(r(u,u^{\prime }) > \lambda _{r}\) and \(\rho (u,u^{\prime }) > \lambda _{\rho }\), if \(s_{e}(v, v^{\prime }) = \{\lambda _{T}, \lambda _{r}, \lambda _{\rho }\}\).

If GD matches GP, then there exists a unique maximum relation M(GP,GD). If GD does not match GP, M(GP,GD) is the empty set. This maximum relation M(GP,GD) is referred to as the set of matches of GP in GD. The relation M(GP,GD) can be depicted as the set of matches of GP in GD.

Example 6

Figure 4 displays two matching subgraphs in terms of MCS based on GP and GD in Figure 3, where the multiple constraints, i.e., λT,λr, λρ and L are satisfied in the two graph pattern matchings.

Figure 4
figure 4

The matching subgraphs of GP in terms of MCS

3.2 The matches of the designated node v d

Given a pattern graph GP, a data graph GD, and a designated node vd, If the nodes in GD can match the designated node vd, matching subgraphs in GD containing these nodes must match GP based on MCS. We denote the matches of vd in GD based on GP as Mu(vd,GP,GD), and denote the matching of ui to vd as uivd.

Example 7

As shown in Figure 4c, the two matching subgraphs can match GP in Figure 3 in terms of MCS. Thus PM2 and PM3 are two matching nodes of PM. Namely, \(M_{PM_{2}}(PM, G_{P},G_{D})=\{DB_{1}, PRG_{2}\}\), \(M_{PM_{3}}(PM, G_{P},G_{D})=\{PRG_{2}, BA, DB_{3}\}\) and PM2PM and PM3PM.

3.3 Ranking of matching

As the number of nodes included in the matching result, and the social impact factor values have significant influence on the quality of the pattern matching [10, 15], in order to rank the delivered pattern matching results to identify Top-K designed nodes, we propose the below ranking functions.

The relevant set

for each descendant \(v^{\prime }\) of vd in GP, the relevant set, \(R_{(u, v_{d})}\), includes all matches \(u^{\prime }\) of \(v^{\prime }\), such that if vd reaches \(v^{\prime }\) via a path \((v_{d}, v_{1},...,v_{n}, v^{\prime })\), then u reaches \(u^{\prime }\) via \((u, u_{1},...,u_{n}, u^{\prime })\), where (ui,uj) ∈ M(GP,GD).

That is, \(R_{(u, v_{d})}\) includes all nodes of the matching in Mu(vd,GP,GD). The larger the \(R_{(u, v_{d})}\), the better the matching [12] .

Example 8

Base on Figure 3, \(R_{(PM_{1}, PM)} = 3\), \(R_{(PM_{2}, PM)} = R_{(PM_{3},PM)} = 2\).

Based on the utility function in (1) [15], we propose the new utility function as (2) that considers the average impact of social contexts of each matching path from ui to uj in the pattern matching, Mu(vd,GP,GD).

Utility function

$$ U(p)=w_{T}*A_{T}(p)+w_{r}*A_{r}(p)+w_{\rho}*A_{\rho}(p) $$
(1)

where wT, wr and wρ are the weights of Tp, rp and ρp respectively; 0 < wT, wr, wρ < 1 and wT + wr + wρ = 1.

The value of these weights can be specified by users to illustrate their different requirements in different applications. For example, in crowdsourcing travel, a user could give a high value to ωt to illustrate the concern about the social relationship between two people, while in employment, a user could give a high value to ωρ to illustrate the concern about the social impact of a people.

$$ \delta(u, v_{d}) = \frac{{\sum}_{u_{i} \in R_{(u, v_{d})},u_{j} \in R_{(u, v_{d})}}^{} U(p(u_{i}, u_{j}))}{N} $$
(2)

where N is the number of matching path in Mu(vd,GP,GD).

Ranking function

Based on the relevant set and the utility function, we propose the ranking function RF(u,vd), on a match u of vd as a bi-criteria objective function in (3), to capture the influences of both the number of matching nodes and the social contexts.

The range of value of relevance function is in the scope of [0,ND] (ND is the number of nodes in GD), the range of value of the utility function is in the scope of [0,1]. Thus, we need to normalize RF(u,vd) to the scope of [0,1]. In the literature, there are some methods for normalization, such as the log function [6], the min-max normalization [6] and arctan() function [6]. As the range of value of the revelent function is in the scope of [0,ND], the log function [6] and the min-max normalization [6] cannot be applied to the normalization. For x ∈ [0,ND], the value of arctan(x) is in the scope of [0,π/2]. Thus, we use arctan() function to normalize RF(u,vd), into the scope of [0,1]. Here, α is used to adjust the weight between the relevance function and the utility function.

$$ RF (u,v_{d}) = \alpha * arctan(R(u, v_{d})*2/\pi) + (1-\alpha )*\delta{(u, v_{d})} $$
(3)

3.4 TAG-K problem

Based on the ranking function RF(u,vd), we propose the context-Aware Graph pattern based Top-K designated nodes finding problem (TAG-K for short). Given a contextual social graph as the data graph GD, a pattern graph GP, a designated node vd in GP, a positive integer K, TAG-K is to identify Top-K matches \({M_{u}^{K}} (M^{K}\subseteq M_{u}(v_{d}, G_{P}, G_{D}))\), such that

$$ \underset{{M_{u}^{K}}\subseteq M_{u}(v_{d}, G_{P},G_{D})}{argmax}{\sum\limits_{u_{i}\in {M_{u}^{K}} }^{}}RF(u_{i}, v_{d}) $$
(4)

That is TAG-K is to identify a set of K matches of vd that maximizes the ranking function \(RF({M_{u}^{K}})\). Namely, \(\forall M^{*}\subseteq M(v_{d}, G_{P}, G_{D})\), if |M| = K, then \(RF({M_{u}^{K}}) \geqslant RF(M^{*})\).

Example 9

In Figure 3, suppose there is a requirement of finding the best match of the designated node PM, and the revelent set and the utility function have the same weight in the ranking function.Thus, based on the social impact factor values in GD and (3), we can get RF(PM2,PM) = 0.69, RF(PM3,PM) = 0.57, RF(PM1,PM) = 0.55. Thus, PM2 is the best matching of PM via GP in Figure 3.

4 Indexing structure

To improve the efficiency of TAG-K finding, we propose two index structures, i.e., Multi-Attributes Tree index, (MA-Tree), and Strong Social Component index, (SSC-Index), to record the labels, indegree, outdegree, the shortest path and the aggregated social impact factor values, which can help efficiently find the candidates of vd in GD.

4.1 MA-Tree

4.1.1 The purpose of the index

Based on the GPM in term of the MCS, if uvd, (1) the label of u must be the same as vd, (2) the outdegree/indegree of u must be greater than 0, if the outdegree and/or the indegree of vd is/are greater than 0. Thus, in order to investigate the potential candidates of vd in GD, based on B+ tree, we propose a Multi-Attributes Tree (MA-Tree) index to record the multiple attributes, including label, indegree and outdegree of each node in GD.

4.1.2 Structure

In MA-Tree, (1) each leaf node contains a pointer, which points to an array that saves the nodes with the same label, indegree and outdegree, and (2) for each non-leaf node, MA-Tree records a tuple, including (category number, indegree, outdegree), where each category is coded by using a digit as the category number, and the tree structure is established based the values of the category number, the indegree and the outdegree of each node respectively. Similar to B+ tree, MA-Tree index can effectively reduce the search space when finding the candidate of vd, and thus can improve the efficiency of TAG-K designated node finding.

Example 10

Figure 5 is a MA-Tree of GD in Figure 3, where the digits, 0, 1, 2, 3 and 4 are given to represent the five categories, PM, DB, PRG, BA and ST, respectively. As the indegree of PM1 is 0, and the outdegree of PM1 is 2. Then, the tuple (0,0,2) is inserted into MA-Tree. Based on the comparison of the category numbers, indegree and outdegree, we can establish the corresponding MA-Tree of GD shown in Figure 5.

Figure 5
figure 5

The MA-Tree of GD in Figure 3

4.1.3 The searching process

Based on MA-Tree structure, we can fast investigate if a node is a candidate of vd by searching MA-Tree from root nodes to leaf nodes. The time complexity of searching MA-Tree is the same as B+ tree, i.e., O(logbND), where ND is the number of of nodes in Gd and b is the number of children nodes at each level in MA-Tree.

4.2 Strong social component index (SSC-Index)

In addition to the information indexed by MA-Tree, we need to investigate if the decedents of the candidate can be a match of GP. Thus, we first propose a concept of the strong social component where the nodes and edges have large social impact factor values, and thus have high probability to satisfy the social context constraints in GP, then we build up the SSC-Index to record the labels, the shortest path length, and the aggregated social impact factor values of the decedents of each node in an SSC, which can be utilized to prune unpromising nodes.

In graph theory [2], a graph G is said to be strongly connected if every vertex is reachable from every other vertex, and a strongly connected component of a directed graph G is a subgraph that is strongly connected. Based on the definition of the strong connection, we give the definition of a Strong Social Component as follows.

Definition 2 Strong Social Component

Given a CSG < V,E,LV,LE >, and two parameters λV and λE with \(0\leqslant \lambda _{v} \leqslant 1 \) and \(0\leqslant \lambda _{E} \leqslant 1 \), the subgraph induced by a subset of node set \(V^{\prime } \in V\) and edge set \(E^{\prime } \in E\) is an SSC if, and only if the following two conditions hold:

  • \(\forall v \in V^{\prime }, LV(v) \geqslant \lambda _{V}\)

  • \(\forall e \in E^{\prime }, LE(e)\geqslant \lambda _{E}\)

where \(E^{\prime }=E \bigcap (V^{\prime } \times V^{\prime })\).

In a CSG, a subgraph is said to be socially strongly connected if each vertex associated with a high role impact factor value in a specific domain is connected with the edges associated with intimate social relationships and strong social trust relationships. A Strong Social Component (SSC) is a subgraph that is socially strongly connected.

Based on the theories in Social Psychology [1], in an SSC, the social structure and the social contexts, including the social trust and social relationships on edges, and the social roles associated with vertices usually stay stable in a very long period of time. This property makes it realistic to index and compress the graph in an SSC with a low update cost. Identifying all the SSC in a specific domain subsumes the classical NP-Complete maximum clique problem [2], which is very time consuming. Alternatively, we can identify up to K SSCs by randomly selecting K vertices that are associated with high role impact factor values as the seeds. Then from each of the seeds, our algorithm adopts Breadth-First Search (BFS) method to find the vertices associated with high role impact factor values connected by the edges associated with high social intimacy degrees and social trust values. In the worst case, our method needs to visit all the vertices and edges in a data graph. The time complexity of the SSC identification is O(NDED), where ND and ED are the number of nodes and edges in GD.

4.2.1 The purpose of the index

Based on MCS, if uvd, (1) For each descendant of \(v^{\prime }\) of vd, u has a descendant, \(u^{\prime }\), which has the same label as \(v^{\prime }\). (2) For each edge, (\(v_{d}, v^{\prime }\)), there exists a path \(p(u, u^{\prime })\), satisfying the constraints of social contexts associated with (\(v_{d}, v^{\prime }\)). Thus, we build up the SSC-Index to record the labels, the shortest path length, and the aggregated social impact factor values of the decedents of each node in an SSC.

4.2.2 Index structure

Reachability index

This index records a list of vertices that one can research another in a graph, where the index of each vertex contains the ancestors and predecessors of the vertex. As the size of SSC is usually much less than the whole data graph [1], building the reachability index is not computationally expensive [32].

Example 11

Figure 6 is an example of our index for the SSC of the graph depicted in Figure 6. From the figure, we can see the indices of each vertex include three parts: they are the reachability index, graph pattern index and social context index. We take vertex E as an example, as it has both ancestors and descendants. The reachability index of E records its ancestor C (i.e., Anc.: C), and its descendant H (i.e., Des.: H). Similarly, we construct the reachability index for each of the other vertices of the graph.

Figure 6
figure 6

The index of an SSC

Given a reachability query in GP, if the candidate nodes are included in the SSC, we can investigate the reachability immediately, greatly saving query processing time.

Graph pattern index

After indexing the reachability information, we further index the graph pattern information to improve the efficiency of graph pattern matching in designated node finding. This index records the shortest path length between any two nodes in the graph of an SSC.

Example 12

Consider the graph pattern index shown in Figure 6. For vertex E, in addition to indexing the reachability information, the graph pattern index records the shortest path length from its ancestor C to E (i.e., Slen = 1), and from E to its descendant H (i.e., Slen = 1). Similarly, we construct the graph pattern index for each of the other nodes.

Given a query of the graph pattern matching with the bounded length, based on the graph pattern index, we can investigate if the indexed path length is greater than the bounded length, and thus can efficiently find a pattern matching result.

Social context index

In order to improve the efficiency of TAG-K finding, we construct the social context index to record the maximal aggregated social impact factor values of the mapped paths in a data graph. Below are the details of the index.

  • If \(p\propto p^{\prime }\), we index that path length and the corresponding aggregated social impact factor values of p instead of \(p^{\prime }\).

  • Otherwise, we index up to three paths that have the maximal aggregated T, r and ρ values respectively.

Example 13

Consider the social context index shown in Figure 6. Here we take the vertex C as an example, where there are two paths from C to its descendant H, e.g., path p1(C,E,H) and p2(C,F,H). As p1 ∝ p2, we index AS(p1(C,E,H)) = {0.96,0.88,0.91} and its path length Plen(p1(C,E,H)) = 2 at C. Similarly, we construct the social context index for each of the other vertices. Given a graph pattern query with multiple constraints, based on the social context index, we can quickly investigate if there exists a pattern match in the data graph, and thus saving query processing time.

4.3 Summary

In TAG-K designated node finding, if the two nodes with a connection in GP can be mapped into a path in SSCs, this indexed information can be used to quickly investigate if there is an edge pattern match, and thus greatly saving graph pattern matching time. In addition, in the worst case, we need to perform the Dijkstra’s algorithm four times, and thus the time complexity of the index construction is O(NDlogND + ED), where ND and ED are the nodes and edges in GD. Furthermore, as mentioned in Section 4.2, the structure and the social contexts of the graph in an SSC usually stay stable in a very long period of time [1]. Therefore, usually it is not necessary to update the indices frequently, which reduces the cost of index maintenance. When there are some changes of the social contexts and/or graph structure in an SSC, we can adopt the existing method [9] to first establish the matrices of the shortest path length, the ancestors and descendants, and the aggregated social impact factor values between vertices, and then iteratively investigate the updated SSC graph, finding the affected vertices and edges to update the corresponding matrices. The index maintenance in dynamic graphs is another challenging research topic and thus it is not discussed in this paper.

5 A-TAG-K: an approximation algorithm for TAG-K

In order to solve the NP-Complete TAG-K designated node finding problem, we propose an approximation algorithm, A-TAG-K, by adopting the Top-K shortest path algorithm [8] to investigate if there is a path in GD can match an edge in GP. Our A-TAG-K can deliver a set, \({M_{u}^{s}}(v_{d}, G_{P}, G_{D})\), where \({M^{K}_{u}}(v_{d}, G_{P}, G_{D}) \subseteq {M_{u}^{s}}(v_{d}, G_{P}, G_{D})\) without accessing all the nodes in GD.

5.1 Algorithm overview

A-TAG-K first computes the upper bound a ranking function of a matching based on all the candidate nodes in the matching, and then A-TAG-K adopts the Top-K shortest path algorithm [8] to investigate the path matching between these candidate nodes, and update the lower bound of the ranking function based on the investigation. When finding K elements where the minimal lower bound is larger than the maximal upper bound of other matching in GD, we deliver the K elements as the TAG-K designated node finding results.

5.2 The bound of TAG-K matching

In order to improve the efficiency of A-TAG-K, we first fast to compute an estimated approximate values of RF(u,vd) as the lower bound and the upper bound of RF(u,vd) respectively, denoted as RFL(u,vd) and RFU(u,vd), where \(RF^{L}(u, v_{d}) \leqslant RF(u, v_{d}) \leqslant RF^{U}(u, v_{d})\). If RFL(ui,vd) > RFU(uj,vd), we can know RF(ui,vd) > RF(uj,vd). Namely ui is a better designated node matching than uj. Thus, after efficiently comparing the lower bound and upper bound of each candidate in GD, A-TAG-K can stop as soon as the Top-K matches are identified, without computing the entire M(GP,GD). Below is the details of the computation of RFL(u,vd) and RFU(u,vd).

Upper bound

Given a candidate u of vd, we use D(u,vd) to denote the set of all descendants of u, where each of the descendant can match the corresponding label of the node \(v^{\prime }\) in GP. In D(u,vd), we compute RF(D(u,vd),vd) as RFU(u,vd). As some of the paths between nodes in D(u,vd) may not be a match of \((v, v^{\prime })\) in GP. Namely, the aggregated social impact factor values of these paths cannot satisfy the corresponding constraints in GP Thus, \(RF(u, v_{d})\leqslant RF^{U}(u, v_{d})\).

Lower bound

Initially, set RFL(u,vd) = 0, A-TAG-K investigates each pair of the descendants in D(u,vd) to find the Top-K shortest path between the nodes based on the algorithm in [8] to compute the utility of the path by (2). After each investigation, RFL(u,vd) increases when investigated the path between two nodes is a matching. During the iterations of investigation, RFL(u,vd) < RF(u,vd), and after all the iterations, we can get RFL(u,vd) = RF(u,vd). Based on the lower bound and upper bound, we have the below Lemma 1.

Lemma 1

A K-element set \({M_{u}^{K}}(v_{d}, G_{P}, G_{D}) \subseteq D (u, v_{d})\) is a set of Top-K matches of v d if (1) each u i in \({M_{u}^{K}}(v_{d}, G_{P}, G_{D})\) is a match of v d , and (2) \(RF_{min}^{L}(u_{i}, v_{d}) \geqslant RF_{max}^{U}(u_{j}, v_{d}), u_{j} \in D (u, v_{d}) \) .

Based on the above Lemma 1, we can perform the early termination after finding the K elements meeting thecorresponding requirements, without computing the entireMu(vd, GP,GD),greatly saving the processing time of TAG-K finding.

5.2.1 Search procedure

The algorithm A-TAG-K mainly has two stages. The first stage is to compute the upper bound of each candidate node, and the second stage is to compute the lower bound of each candidate node. Below are the details of A-TAG-K.

  1. (1)

    UpperBound (Algorithm 1) A-TAG-K selects a node u from Vcandidate to updates the RFU(u,vd). During this process, u firstly is put into Vtemp. Then, for each adjacent edge \((v_{d}, v^{\prime })\) of the corresponding vd of u, A-TAG-K finds the matching via the BFS method to find the set \(V^{\prime }\) of all descendants of u that have the same labels as the children of vd. For each node in \(V^{\prime }\), based on the Top-K shortest path selection algorithm [8], A-TAG-K computes RFU(u,vd), and updates the upper bound All descendants in \(V^{\prime }\) will be put into Vtemp, and RFU(u,vd) takes a node \(u^{\prime }\) from Vtemp, and updates the corresponding upper bound. If Vtemp is empty, based on (3), the upper bound of u can be delivered.

    figure a
    figure b
  2. (2)

    LowerBound (Algorithm 2) A-TAG-K selects a node \(u^{\prime }\) from Vleafnode to check if its parent node u can be a matching, and updates the lower bound of u. After updating the lower bound of u, A-TAG-K will detect the early termination, and A-TAG-K will stop if the early termination can be satisfied. Because \(u^{\prime }\) is a leaf node, we can update the corresponding ranking function. Then, if u is the candidate of vd, A-TAG-K will calculate the lower bound of u.

Summary

Our proposed A-TAG-K algorithm is an efficient and effective method for the TAG-K problem in large-scale contextual social graphs. Our method achieves \(O(E_{P}{N^{2}_{D}} + E_{P}N_{D})\) computation cost, where ND is the number of nodes in GD, and EP is the number of edges in GP.

6 Experiments

We conduct experiments on five large-scale real-world social graphs to evaluate (1) the effectiveness our algorithm in finding TAG-K designated nodes; and (2) the efficiency of our A-TAG-K algorithm.

6.1 Experiment setting

Datasets

We use five real social graphs available at snap. stanford.edu, which have been widely used in the literature for graph pattern matching and social network analysis. The details of these datasets are shown in Table 1.

Table 1 The Social Datasets

Pattern graph and parameter setting

  • We use a popular social network generation tool, SocNetV (socnetv.org, with version 2.2) to generate five query graphs, and the details of these graphs are shown in Table 2, where we random select a node from each of the pattern graph as the designated node vd.

  • A set of relative low constraints are specified as λT = 0.05, λr = 0.05 and λρ = 0.2, to ensure the high possibility of returning TAG-K designated nodes in a data graph [17]. Otherwise, no or only few answers might be returned by all the algorithms, making it difficult to investigate their performance.

  • As we discussed in Section 3, the social context impact factor values (i.e., T, r and ρ) can be mined from the existing social networks, which is another very challenging problem, but out of the scope of this work. Moreover, in the real cases, the values of these impact factors can vary from low to high without any fixed patterns. Without loss of generality, we randomly set the values of these impact factors by using the function rand() in SQL. In addition, in each of the datasets, the SSC number is set to 20, 40, 60, 80 and 100, respectively; α is set to 0.6, 0.7, 0.8, 0.9 and 1; K is set to 5, 10, 15, 20 and 25; and the maximal bounded path length of the pattern matching is set as 5, 10, 15, 20 and 25.

Table 2 The pattern graphs

6.2 Implementation

We compare our method with the most promising algorithm in Top-K designated nodes finding, TopKP [10]. All A-TAG-K and TopKP algorithms are implemented using Java running on a PC with Intel Core i5-3470 3.20GHz CPU, 16GB RAM, Windows 10 operating system. All the experimental results are averaged based on five independent runs.

6.3 Experimental results and analysis

Exp-1: Effectiveness :

This experiment is to investigate the effectiveness of our A-TAG-K by comparing the average ranking function values of the Top-K designated nodes based on different setting of parameters.

Results

Figures 789, and 10 depict the average ranking function values of the delivered pattern matching with different setting of paramaters, by each of A-TAG-K and TopKP. From these figures, we can see that the average ranking function values of the Top-K designated nodes returned by TopKP are always less than that of A-TAG-K. Statistically, on average, A-TAG-K can return answers with the average ranking function value which is 72.54% less than that of TopKP.

Figure 7
figure 7

The average ranking function value based on different α

Figure 8
figure 8

The average ranking function values based on different K

Figure 9
figure 9

The average ranking function value based on different L

Figure 10
figure 10

The average ranking function value based on different ω

Analysis

The experimental results illustrate that (1) TopKP considers the number of nodes only in designated node finding, but does not take the social contexts into consideration; and (2) our A-TAG-K can deliver the Top-K pattern matching results with considering both the number of nodes and the social contexts, which can effectively improve the quality of the query results.

Exp-2: efficiency :

This experiment is to investigate the efficiency of our A-TAG-K by comparing the average query processing time of A-TAG-K and TopKP based on different setting of parameters.

Results

Figures 111213, and 14 depict the average query processing time of A-TAG-K and TopKP in returning different numbers of designed nodes with different setting of parameters. From these figures, we can see that A-TAG-K has better efficiency than TopKP for the Top-K designated nodes finding in all the cases in the five datasets. Statistically, on average, the query processing time of A-TAG-K is 44.25% less than that of TopKP.

Figure 11
figure 11

The query time based on different α

Figure 12
figure 12

The query time based on different K

Figure 13
figure 13

The query time based on different L

Figure 14
figure 14

The query time based on different ω

Analysis

The experimental results illustrate that A-TAG-K can efficiently find the candidate node of vd in terms of MA-Tree, and can efficiently find the pattern matching in terms of the SSC-Index, which avoids to visit all the nodes and edges in a data graph. Thus, A-TAG-K can greatly save the query processing time.

6.4 Summary

The above experimental results demonstrate that the proposed A-TAG-K algorithm provides an effective means to find context-aware graph based Top-K designated nodes. In addition, with our proposed index structures, A-TAG-K can efficiently find the candidates, which greatly saves query processing time. Therefore A-TAG-K significantly outperforms the existing most promising algorithm for Top-K designated node finding, TopKP, in both effectiveness and efficiency. Therefore, A-TAG-K is a very competitive algorithm for the new TAG-K designated node finding problem in social network based applications.

7 Conclusion

In this paper, we have proposed an approximate algorithm A-TAG-K to support a new type context-aware graph pattern based Top-K designated node finding problem that is a corner stone for many social network based applications. A-TAG-K achieves \(O(E_{P}{N^{2}_{D}} + E_{P}N_{D})\) in time cost, where ND is the number of nodes in GD, and EP is the number of edges in GP, and the experiments conducted on five real-world large-scale social graphs have demonstrated the superiority of our proposed approaches in terms of effectiveness and efficiency.