1 Introduction

Attributed graph, a subtype of graph where each node contains a set of attributes (as shown in Fig. 1a), has become increasingly popular to model web and social networks. Many graph queries are developed to analyze and retrieve the rich semantic and structural information of these graphs. Among them the subgraph matching, which retrieves all subgraphs isomorphic to a given query graph, is fundamental to many graph data analytics [1]. However, as graph data size continues to grow, storing and processing them impose expensive upfront infrastructure costs on users. As such, many cloud service providers such as Amazon, Alibaba and Microsoft Azure offer graph outsourcing services by storing graphs owned by users and executing mining tasks on behalf of them. GraphLab [2] even provides a graph-based Software-as-a-Service (SaaS).

Fig. 1
figure 1

Original data graph G and query Q

However, the cloud server is considered as “honest-but-curious” (a.k.a., semi-honest), which is consistent with most related works in the literature [3,4,5,6]. On the one hand, the cloud server acts in an “honest” fashion, i.e., correctly follows the designated protocol specification such as HIPAA complianceFootnote 1 and always offers correct computations without cheating. On the other hand, it is “curious” to infer the sensitive information of the graph data by analyzing index structures, requested queries and query results. This assumption does not reduce the need for storing and processing graphs on the untrusted cloud provider. For example, Paysafe, a Fintech company, needs to store and analyze the payments data including attributes and interactions of customers and their payments (which are modeled as a large graph), but it is unwilling to spend expensive upfront infrastructure costs on it. Hence, the company resorts to a graph-based cloud service, Graph Database and Graph AnalyticsFootnote 2 provided by ORACLE [7], even it is assumed that the cloud provider is “honest-but-curious.” In general, cloud servers can learn the attribute values (a.k.a., labels) and structural information of the outsourced graphs. Privacy breaches on these graphs are known to disclose sensitive information [3], which can be grouped to two categories: content disclosure and identity disclosure. Content disclosure compromises sensitive label information, such as salary, social security number, or medical history of a user. To guard against content disclosure, three classic privacy-preserving techniques have been proposed, namely, k-anonymity [8], \(\ell \)-diversity [9], and t-closeness [10]. They aim to generalize several labels into a single one (also known as an equivalence class) to hide the sensitive information of a single label so that the attacker can only see the generalized label. In particular, k-anonymity requires that each equivalence class contains at least k records so that each record is indistinguishable with at least \(k-1\) other records. \(\ell \)-diversity requires that the distribution of a sensitive attribute in each equivalence class has at least \(\ell \) “well-represented” values. While t-closeness requires the label distribution in each equivalence class is no more than t distance away from that in the whole set of labels. It is known that t-closeness is able to resist more attacks (such as similarity attack) than k-anonymity and \(\ell \)-diversity and is the most stringent privacy metric among the three.Footnote 3 However, when it comes to protecting labels in attributed graphs, while [3] and [12] adopt k-anonymity and \(\ell \)-diversity, respectively, there is no work that adopts t-closeness. Identity disclosure [13, 14] compromises the location of a target node in the graph even after removing the node’s identity. This disclosure can be caused by various structural attacks [13, 15, 16] such as degree attack, 1-neighbor-graph attack, subgraph attack and hub-fingerprint attack. To guard against these attacks, many structure-privacy-preserving techniques [17,18,19] have been developed to enforce symmetry in an outsourced graph. In particular, [17] proposes the k-automorphism model. Given a graph G, it transforms G into a k-automorphic graph \(G^k\) by introducing some noise edges and vertices, where each vertex has at least \((k - 1)\) other symmetric vertices. Hence, there are no structural differences between any vertex and its \((k - 1)\) symmetric vertices. In other words, the attacker cannot distinguish a vertex from the other \((k - 1)\) symmetric vertices. And it is known that k-automorphism strategy can defend against any structure-based attack [17].

Example 1

Consider the graph G in Fig. 1a where each vertex represents an entity. There are three entities, individual entity (p), school entity (s) and company entity (c). The edges in G represent the relations between two entities, such as “Work at” relation, “Work together” relation and “Graduate from” relation. If an adversary knows that one person has a 1-degree neighbor node in the graph G, he can immediately know that node \(p_4\) is that person and the related attributes of node \(p_4\) are revealed. k-automorphism model can be applied to prevent such structure attack by introducing noise edges to construct a k-automorphic graph \(G^k\). For example, \(G^k\) in Fig. 2a is a k-automorphic graph of G where \(k=2\). In this figure, the noise edges are shown by black dashed lines. Note that each node contains a set of attributes where sensitive information (e.g., salary) is contained. In this case, the privacy model for structural privacy (e.g., k-automorphism) is not sufficient to protect the label privacy of each node. For example, even for the k-automorphic graph \(G^k\), if the labels are not anonymized, the adversary can use some prior knowledge or background knowledge (e.g., the “Occupation” attribute in the G, Fig. 1a) to identify the location of an individual. Suppose the adversary can identify an individual as the node \(p_1\) or \(p_4\) of G in Fig. 1a, then he can conclude that this person’s salary is high (“14,000” or “15,000”) without exactly re-identifying the node. Therefore, when sensitive labels are considered, the t-closeness should be adopted for graphs to generalize each label into a generalized one as shown in Fig. 2(a).

The example above shows the necessity of preserving both structural privacy and label privacy. However, except for [3, 20], which only support k-anonymity, all existing works on attributed graphs consider either structural or label privacy. In this paper, we propose (kt)-privacy, an integrated privacy model for outsourced graphs. Any graph that satisfies (kt)-privacy must satisfy both k-automorphism and t-closeness for each generalized label of all node attributes. Intuitively, the (kt)-privacy model can preserve both structural privacy and label privacy.

Once the graph G is anonymized, a straightforward method is to upload the k-automorphic graph \(G^k\) to the cloud and perform subgraph queries over \(G^k\). However, there are some major limitations. Firstly, a lot of noise edges and vertices will be introduced to the original graph G when constructing the k-automorphic graph \(G^k\). This results in a much larger graph on the cloud side, which leads to much more expensive storage cost and much higher query cost. Although Chang et al. [3] propose to store only a succinct version of k-automorphic graphs, query decomposition is adopted during subgraph matching, which causes an extra result-joining step and incurs a large number of false positive results. Secondly, while imposing t-closeness on the labels in a graph G, different outputs of label generalization have different impacts on not only the privacy strength but also the search space (i.e., number of vertices or matchings to explore for a query, see Section 4). Intuitively, the larger search space will cause expensive query cost. The second drawback is illustrated with the following example.

Example 2

Consider the attribute “Expected Salary” of the graph G in Fig. 1a, the original set of labels \(l=(7000\), 8000, 14000, 15000). There are two possible label generalizations. The first generalizes (7000, 8000) as a generalized label “A” and (14000, 15000) as another generalized label “B.” In this case, given the query Q in Fig. 1b and its anonymized form, there are two matchings, \((c_1, p_1, p_2, s_1)\) and \((c_2, p_3, p_4, s_2)\). The second generalization is to generalize (7000, 15000) as “A” and (8000, 14000) as “B.” In this case, there is only one matching \((c_1, p_1, p_2, s_1)\). Obviously, the first generalization will lead to a false positive matching (i.e., \((c_2, p_3, p_4, s_2)\).).

To address subgraph matching efficiency problem due to the enlarged graph size caused by \(k-\)automorphism and the enlarged search space caused by the generalization using \(t-\)closeness, we model the cost of subgraph matching and reduce the problem of optimal generalization of labels for \(t-\)closeness to the General Set Partitioning Problem [21]. Based on this, we propose an efficient approximation algorithm TOGGLE with a bounded approximation ratio of \((1+\epsilon )\). Furthermore, as the cloud only stores a succinct version of k-automorphic graphs, query decomposition is adopted during subgraph matching [3], which causes an extra result-joining step and incurs a large number of false positive results. We design a new subgraph matching algorithm PGP that can directly work on such graphs without decomposition. To summarize, our main contributions are as follows:

  1. (i)

    We propose (kt)-privacy for outsourced graphs, which is, to the best of our knowledge, the most stringent generalization-based privacy model for both graph structure and label generalization.

  2. (ii)

    We propose a t-closeness label generalization algorithm TOGGLE that optimizes costs for subgraph matching. It is proved to have an approximation ratio of \((1+\epsilon )\).

  3. (iii)

    We also develop a partial-graph-based subgraph processing algorithm PGP without query decomposition. It exploits the symmetry of the outsourced graph and limits search scope to a localized region.

  4. (iv)

    We conduct an empirical study on our algorithms in all three datasets ever experimented in the literature and show their high efficiency under various parameter settings.

The remainder of this paper is organized as follows. Section 2 presents the background and problem statement. Section 3 overviews the graph outsourcing and subgraph matching framework with a baseline solution adapted from [3]. In Section 4, we present the label generalization algorithm TOGGLE, and in Section 5, we introduce the PGP subgraph matching algorithm. Experiments and related works are presented in Sections 6 and 7, respectively, followed by a conclusion in Section 8. Formal proofs of lemmas are in Appendix A.

Table 1 List of key notations
Fig. 2
figure 2

k-automorphic data graph \(G^{k}\), outsourced graph \({\widetilde{G}}^k\) and Label Correspondence Table (LCT)

Fig. 3
figure 3

Alignment Vertex Table and Outsourced query \({\widetilde{Q}}\)

2 Background and problem statement

In this section, we first introduce the background of two privacy models in (kt)-privacy, namely t-closeness and k-automorphism. We then present the privacy attacks on graphs and summarize them as the threat model for this paper. Finally, the formal definition of privacy-preserving subgraph matching with (kt)-privacy is presented. Table 1 lists the key notations and acronyms used in this paper.

2.1 t-Closeness

Generalization, which combines several values into a single one (also known as an equivalence class), has been a popular anonymization technique for labels [3, 22, 23]. The output of a label generalization algorithm is a label correspondence table which maps this correspondence. t-closeness [10] is a privacy metric that imposes a constraint on each equivalence class.

Definition 1

(t-closeness) An equivalence class satisfies t-closeness if and only if the label distribution in this class is no more than t distance away from that in the whole set of labels. If all equivalence classes satisfy t-closeness, the label generalization algorithm satisfies t-closeness.

The distance is measured by the Earth Mover’s Distance (EMD) [24], which is based on the minimum workload to transform one distribution to another. Specifically, each distribution is viewed as a mass of earth, and in each step of the transformation, some earth is moved from one place to another. The moved mass multiplied by the ground distance of this move is the workload of this step and added to the total workload. To find the minimum workload, the EMD computation relies on the well-known combinational optimization problem—balanced transportation. Formally, let \({\mathcal {P}}= ({p}(v_1),{p}(v_2),\ldots ,{p}(v_n))\) and \({\mathcal {P}}'= ({p}(v'_{1}),{p}(v'_{2})\), \(\ldots \), \({p}(v'_{m}))\) denote the two distributions and \(d_{i,j}\) be the ground distance between \({v}_i\) and \(v'_j\),Footnote 4   the balanced transportation finds a mass flow \(F=\{f_{i,j}\},i\in [1,n], j\in [1,m]\) such that \(\sum _{i=1}^{n}\sum _{j=1}^{m} d_{i,j} \times f_{i,j}\) is minimum, subjects to \( f_{i,j} \ge 0, \) and \( \sum _{i=1}^{n}\sum _{j=1}^{m} f_{i,j} = \sum _{i=1}^{n} p(v_i) = \sum _{j=1}^{m} p(v'_{j}) =1. \) Solving the mass flow F, we can obtain the Earth Mover’s Distance between distributions \({\mathcal {P}}\) and \({\mathcal {P}}{'}\) as: \( EMD({\mathcal {P}},{\mathcal {P}}{'})\) \(=\) \(\sum _{i=1}^{n}\sum _{j=1}^{m}\) \(d_{i,j} \times f_{i,j}. \)

Example 3

Figure 2b shows a label correspondence table. For attribute “Expected Salary” of graph G in Fig. 1a, the original set of labels \(l=(7000\), 8000, 14000, 15000). There are two possible label generalizations. The first combines (7000, 8000) as a label group “A” and (14000, 15000) as another group “B.” In this case, EMD(lA) \(=(0/3+1/3+1/3+2/3)*1/4= 1/3\), where 0/3 is the ground distance \(d_{0,0}\) and 1/4 is the mass \(f_{0,0}\). Similarly, \(EMD(l,B)=1/3\). As such, this method satisfies 1/3-closeness. The second generalization, as illustrated in Fig. 2b, combines (7000, 15000) as “A” and (8000, 14000) as “B.” In this case, EMD(lA) and EMD(lB) are both 1/6. Therefore, this method satisfies 1/6-closeness, which preserves more privacy than the first generalization.

2.2 k-Automorphism

k-automorphism is a graph privacy model that can defend existing structural attacks [17], including degree attack, 1-neighbor-graph attack, subgraph attack and hub-fingerprint attack [13, 15, 16]. The idea is to construct \((k-1)\) symmetric blocks for each block (i.e., a subset of vertices and their corresponding edges) in a graph, so that a vertex cannot be distinguished from other \((k-1)\) vertices in those symmetric blocks. The resulted graph is a k-automorphic graph. Converting a graph G to a k-automorphic graph \(G^{k}\) involves three steps—graph partition, graph (block) alignment and edge copy. First, the vertices in G are partitioned into k blocks by adopting the metis algorithm [25]. Second, graph (block) alignment selects and aligns those vertices which have the largest degree in each block and then aligns all other vertices in the same block with those vertices in other blocks by their breath-first search (BFS) traversal order. The result is an alignment vertex table (AVT) where each column corresponds to one block and each row denotes the mapping of k vertices. The same mapping can also be recorded by an Automorphic Function. Third and finally, based on AVT, symmetry edges are inserted in other \((k-1)\) blocks for each edge in one block. Crossing edges between two blocks are also copied accordingly.

Example 4

Graph G in Fig. 1a is first partitioned into \(k=2\) blocks, \(B_1\)=\(\{p_1,p_2,s_1,c_1\}\) and \(B_2\)=\(\{p_4,p_3,s_2,c_2\}\) (see Fig. 2a). By graph alignment, each vertex in one block is aligned with that in the other. For example, \(p_1\) is aligned with \(p_4\), so \((p_1,p_4)\) is inserted to the alignment vertex table in Fig. 3a. Equivalently, their mapping is recorded by a k-dimensional automorphic function \(F_1\). After all vertices are aligned, the missing symmetry edges between \(p_3\) and \(p_4\) and between \(p_3\) and \(s_2\) (the dashed line in Fig. 2a) are inserted. Finally, the crossing edge between \(s_1\) and \(p_3\) is copied between \(p_2\) and \(s_2\). The resulted k-automorphic graph \(G^{k}\) is shown in Fig. 2a.

2.3 Graph privacy attacks and threat model

In this paper, we study attributed graph, which is an undirected graph with attributes on each node [3].

Definition 2

(Attributed Graph) An attributed graph is defined as \(G=(V,E,T,\Gamma ,L)\) where V is a vertex set; E is an edge set; and \(T,\Gamma ,L\) denote the sets of vertex types, attributes and labels, respectively. Each vertex has a unique type and one or more attributes (depending on the type). Each attribute takes one from a set of labels as its value.

Example 5

The data graph G in Fig. 1a is an attributed graph with three vertex types, namely “Individual,” “University” and “Company.” “Individual” contains two attributes, namely “Occupation” and “Expected Salary.” The label value of “Expected Salary” of vertex \(p_1\) is “15,000.”

In the literature, privacy attacks on outsourced graphs lead to identity disclosure (structural attacks) and content disclosure (label attacks). Structural attacks include degree attack, 1-neighbor-graph attack, subgraph attack and hub-fingerprint attack [13, 15, 16]. Label attacks include background-knowledge attack, homogeneity attack, skewness attack and similarity attack [9, 10]. A common approach in these attacks is to first identify the target node and then unveil its sensitive labels. The following definition abstracts the capability of all known attacks as above and serves as the threat model of this paper.

Definition 3

(Threat model of graph privacy attacks) An attacker has the complete structure information about the target node, including degree, neighbor list and shortest-distances from known nodes (a.k.a., hubs). She also has the complete label information of all nodes except for the target node. Based on such information, she attempts to match the target node in the outsourced graph and then to unveil its label.

2.4 Problem statement

Our problem of privacy-preserving subgraph matching with (kt)-privacy involves two subproblems.

Definition 4

(Graph Outsourcing Problem with (kt)-Privacy) Given a data graph G, to compute an outsourced graph to the cloud that satisfies the following two privacy metrics.

  1. (i)

    Preserving label privacy by enforcing t-closeness for its labels,

  2. (ii)

    Preserving structure privacy by enforcing k-automorphism for its structure.

Privacy guarantee of \(\mathbf {(k, t)}\)-privacy For any outsourced graph that satisfies (kt)-privacy, the probability that an attacker correctly re-identifies a target node using any structural information is at most 1/k, as the k-automorphism model ensures each node is structurally indistinguishable from at least \(k-1\) other nodes. Even within the 1/k probability of which the attacker re-identifies the target node, she can only learn limited information about the node’s true label as this label has been generalized to a label group whose distribution of underlying labels is at most t distance away from the distribution of all labels in this graph.

Definition 5

(Subgraph matching problem on outsourced graph) Given a data graph G and its corresponding outsourced graph, for a query Q, to retrieve all subgraphs \(\{g_i\}\) of G, each of which is subgraph isomorphic to Q and vice versa. \(Q=(V_1,E_1,T_1, \Gamma _1,\) \(L_1)\) is subgraph isomorphic to \(g_i=(V_2,E_2,T_2, \Gamma _2,L_2)\) if and only if there exists at least one injection function \(f: Q\rightarrow g_i\) such that

  1. (i)

    \(\forall u \in V_1, \exists f(u) \in V_2\) such that \(T_1(u) = T_2(f(u)) \) and \(L_1(u) \subseteq L_2(f(u))\).

  2. (ii)

    \(\forall (u,v) \in E_1, (f(u),f(v))\in E_2\).

Example 6

Figure 1b presents a query Q. A UIUC spin-off company wants to hunt two employees in a professional social network. The searching criteria are as follows: (1) They both graduated from UIUC, (2) they are working for the same Internet company, and (3) one is an engineer whose expected salary is 15, 000/mo, and the other is an art designer whose expected salary is 7, 000/mo.

Theorem 1

Both graph outsourcing problem with (kt)-privacy and subgraph matching problem on outsourced graph are NP-Hard.

3 Solution overview

In this section, we overview our solution to graph outsourcing and subgraph matching on outsourced graphs. The workflow of our solution is depicted in Fig. 4. Given a data graph G, the client user anonymizes it to satisfy (kt)-privacy (step \(\textcircled {\small {1}}\)). The result consists of a k-automorphism graph \(G^{k}\) and a label corresponding table LCT. Since each vertex in \(G^{k}\) has \((k-1)\) symmetric vertices in the other \((k-1)\) blocks, the client user only needs to upload a succinct version of k-automorphism graph \({\widetilde{G}}^k\) to cloud (step \(\textcircled {\small {2}}\)), which is also called an outsourced graph [3]. \({\widetilde{G}}^k\) comprises of vertices and edges in one block together with their 1-hop neighboring vertices and edges. For example, the one inside red dashed circle in Fig. 2a is the succinct \({\widetilde{G}}^k\).

Upon receiving a matching query \({\widetilde{Q}}\) (step \(\textcircled {\small {4}}\)) transformed from the client’s original query Q (e.g., Fig. 3b from Fig. 1a), the cloud evaluates it based on \({\widetilde{G}}^k\) and LCT. The results are sent back to the client (see step \(\textcircled {\small {5}}\)) for further refining to filter those false positive results. In the rest of this paper, we only focus on two key components highlighted in Fig. 4, namely graph outsourcing with \(\varvec{(k, t)}\)-privacy and subgraph matching.

Fig. 4
figure 4

Workflow of subgraph matching on outsourced graph

In the rest of this section, we propose a baseline solution by adapting existing work [3] to generalize labels into label groups. Then, we discuss its limitations, based on which we propose two new algorithms for label generalization and subgraph matching in Sections 4 and 5, respectively.

3.1 Baseline solution

3.1.1 Graph outsourcing with (kt)-privacy

The idea is to work on k-automorphism and t-closeness separately. We first transform the original graph G to a k automorphic graph \(G^{k}\) and then adapt existing label generalization algorithm in [3] to satisfy t-closeness metric. Given a vertex type that contains n vertex labels \(l=(l_{1}, l_{2},\ldots , l_{n})\), we propose Algorithm 1 to generalize n labels into m groups. It first permutates the set of labels into \(P=(l_{p_1}, l_{p_2},\ldots , l_{p_{n}})\) and then sequentially divides P into m groups, each of which contains n/m labels and satisfies t-closeness.

figure a

In particular, Algorithm 1 initializes the global minScore and the permutation \(\{y_j\}\) (Lines 2 and 3) where \(y_j\) is a label group and \( j \in \{1,2,\ldots ,m\}\). Then, the function groupEnum is called recursively to find the optimal permutation by gradually expanding the candidate permutation P (Line 4).

In this groupEnum function, it first checks whether the current P is a complete permutation of size n (Line 6). If so, its cost (e.g., a estimated search space) is calculated by getCost(P) (Line 7). After that, P is updated to \(\{y_j\}\) if the cost is lower than the current lowest (Lines 8 and 9). Otherwise, it greedily enumerates n/m labels as a label group from l and records them in the set \(P_s\) (Line 11). For each label group \(y'_j\) in this set, a checkTclo routine calculates EMD by solving a transportation problem [24] and then checks whether \(y'_j\) satisfies t-closeness (Line 13). If it is true, \(y'_j\) will be appended to the candidate permutation P and removed from the labels l (Line 14).

Complexity Analysis. The space complexity of Algorithm 1 is O(n). As for time complexity, let \(t_1\) denote the time complexity of checkTclo, which is invoked at most \(\frac{n!}{((n/m)!)^{m}}\) times. As such, the time complexity is \(O(\frac{n!t_1}{((n/m)!)^{m}})\), which is exponential to n.

3.1.2 Subgraph matching

As described in [3], for a subgraph matching query Q at the client, the client first generalizes its vertex labels by the label correspondence table and sends the generalized query \({\widetilde{Q}}\) to the cloud. Since the cloud can only access the succinct graph \({\widetilde{G}}^k\), which consists of one block of \(G^{k}\) together with its 1-hop neighbors, it uses a special star-based subgraph matching algorithm. On the cloud side, the algorithm consists of three steps.

  1. (i)

    Query decomposition. The cloud first decomposes query \({\widetilde{Q}}\) (e.g., Fig. 3b) into a set of star shapes \(\{S_i\}\) (see Fig. 3b), each of which has a root vertex together with its adjacent edges and neighboring vertices.

  2. (ii)

    Star matching. The cloud then retrieves matchings for each decomposed star \(S_i\) in succinct graph \({\widetilde{G}}^k\), denoted by \(R(S_i,{\widetilde{G}}^k)\), and leverages the symmetry of \(G^k\) to retrieve the matchings for \(S_i\) in \(G^k\), denoted by \(R(S_i,G^{k})\).

  3. (iii)

    Star joining. The cloud starts with a \(R(S_i,G^{k})\) and iteratively computes its natural join with \(R(S_j,G^{k})\), until all \(j \ne i\) are joint up. The results \(R({\widetilde{Q}},G^{k})\) are the matchings for \({\widetilde{Q}}\) over \(G^{k}\).

On the client side, the algorithm filters false positives in \(R({\widetilde{Q}},G^{k})\) using the original graph G and query Q, and obtains the final subgraph matchings R(QG).

3.2 Limitations

Though the baseline solution can support privacy-preserving subgraph matching with (kt)-privacy, it has two main drawbacks. First, the time complexity of label generalization is exponential to the number of labels as all permutations need to be considered. Second, the star-based subgraph matching is also inefficient because: (1) due to query decomposition it cannot narrow down the search scope of the query to a localized region in the graph, and (2) the natural join is computationally intensive. We illustrate the second drawback with an example.

Example 7

Figure 3b shows a generalized query graph \({\widetilde{Q}}\). The baseline algorithm decomposes \({\widetilde{Q}}\) into two stars, \(S_1\) and \(S_2\). Three matchings are found for \(S_1\) on \({\widetilde{G}}^k\), namely, \((p_1,c_1,s_1,p_2)\), \((p_2,c_1,s_1,p_1)\) and \((p_2,c_1\), \(s_2,p_1)\). Similarly, three matchings are found for \(S_2\) on \({\widetilde{G}}^k\), namely \((c_1,s_1,p_1)\) , \((c_1,s_1,p_2)\) and \((c_1, s_2,p_2)\). As such, the natural join needs to join all six matchings for both stars. Since each matching of \(S_1\) should join with all matchings of \(S_2\), nine join operations are needed.

To overcome the first drawback, in Section 4, we propose the TOGGLE algorithm (T-closeness-Optimized Graph Generalization on Label Extension) for label generalization. It significantly reduces the search space for (sub)-optimal label generalization. For the second drawback, we present the PGP algorithm (Partial-Graph-based subgraph Processing) in Section 5. It exploits the symmetry of the outsourced graph and eliminates the need for query decomposition.

4 TOGGLE for label generalization

As we discussed in Section 1, different outputs of label generalization have different impacts on the search space and the larger search space will cause expensive query cost. Hence, TOGGLE aims to generalize labels into groups to minimize the search space, which refers to the total number of vertices to explore for a query. To this end, we first study how to estimate the search space of subgraph matching (Section 4.1). We then introduce the optimal TOGGLE for label generalization that minimizes the search space by formulating the minimization problem into a combinational optimization problem with constraints (i.e., TOGGLE problem, Section 4.2). We further present an approximate TOGGLE algorithm with a bounded error in Section 4.3.

4.1 Estimating search space for subgraph matching

To estimate the search space for a query \({\widetilde{Q}}\) over an outsourced graph \({\widetilde{G}}^k\), we assume a general expansion-based graph search [3, 26] as follows. The first vertex q of \({\widetilde{Q}}\) is selected based on degree and neighborhood signature. q is then matched on \({\widetilde{G}}^k\) with any vertex that contains the same vertex type and label group as q. After that, other vertices of \({\widetilde{Q}}\) are matched with neighbors of those already matched in \({\widetilde{G}}^k\). For the first vertex q, the number of vertices to explore can be estimated by the number of vertices in \({\widetilde{G}}^k\) multiplied by the probability of these vertices being the same types and having the same labels as q. For other vertices of \({\widetilde{Q}}\), the number of vertices to explore in \({\widetilde{G}}^k\) is limited to neighbors of those already matched in \({\widetilde{G}}^k\). In the end, the search space of \({\widetilde{Q}}\) over \({\widetilde{G}}^k\) is the product of all these numbers. Formally, for the \(\tau \)-th vertex type, let \(n_\tau \), \(m_\tau \) and \(p_{r_\tau (j-1)+i}\) denote the number of labels, the number of label groups and the i-th position in the j-th label group, respectively. \(r_\tau =n_\tau /m_\tau \). We also use \(F_{G}(\tau )\) to denote the probability of vertices in a graph G (e.g., \(G^{k}\), \({\widetilde{Q}}\)) being the type \(\tau \). Similarly, \(F_{G}^{l}(\tau ,i)\) and \(F_{G}^{g}(\tau ,j)\) denote the probability of the i-th label and the j-th label group (after the label generalization) being this type, respectively.

For first vertex q, its number of matchings can be estimated by the number of vertices in \({\widetilde{G}}^k\) (approximately \(\frac{|V(G^{k})|}{k}\)) which multiplies the matching probability of q over \({\widetilde{G}}^k\). Formally,

$$\begin{aligned} \frac{|V(G^{k})|}{k}\sum _{\tau =1}^{\mathbb {T}} \Bigg [F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau ) \sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j) \Bigg ] \end{aligned}$$
(1)

where \(\mathbb {T}\) is the number of vertex type.

For other vertices, their matches are restricted to neighbors of those already matched in \({\widetilde{G}}^k\). Let \(|{\widetilde{Q}}|\) denote the number of vertices in \({\widetilde{Q}}\) and the average degree \(D(G^{k})\) approximately represent the number of neighbors of the vertex in \(G^k\), the number of matches for the other \(|{\widetilde{Q}}|-1\) vertices can be estimated as follows.

$$\begin{aligned} \Bigg \lbrace D(G^{k}) \sum _{\tau =1}^{\mathbb {T}} \Bigg [F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau ) \sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j) \Bigg ]\Bigg \rbrace ^{|{\widetilde{Q}}|-1} \end{aligned}$$
(2)

Therefore, the search space of subgraph query \({\widetilde{Q}}\) is the product of Expressions (1) and (2), which directly correlates with \(F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )\) \(\sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j)\). Notably, each vertex in \(G^k\) has the union of label groups of all its \((k-1)\) symmetric vertices, which implies that \(F_{G^{k}}^{g}(\tau ,j)\) will increase by a factor of no more than \(k-1\). In other words, assuming that the j-th label group of the \(\tau \)-th vertex type contains \(r_\tau \) labels, \(\{l_{\tau ,p_{r_\tau (j-1)+i}} | i\in [1,r_\tau ]\}\), where \(l_{\tau ,p_{r_\tau (j-1)+i}}\) is the label locating at position i of this label group, \(F_{G^{k}}^{g}(\tau ,j) \le k\sum _{i=1}^{r_\tau }F_{G}^l(\tau , p_{r_\tau (j-1)+i})\) is therefore obtained. Additionally, \(F_{{\widetilde{Q}}}^{g}(\tau ,j) = \sum _{i=1}^{r_\tau }F_{Q}^l(\tau , p_{r_\tau (j-1)+i})\) can be easily derived. \(F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )\) \(\sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j)\) is therefore bounded by the product of \(kF_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )\) and

$$\begin{aligned} \sum _{j=1}^{m_\tau } \Bigg [\sum _{i=1}^{r_\tau }F_{G}^l(\tau , p_{r_\tau (j-1)+i}) \Bigg ]\Bigg [\sum _{i=1}^{r_\tau }F_{Q}^l(\tau , p_{r_\tau (j-1)+i}) \Bigg ]. \end{aligned}$$
(3)

Given a vertex type \(\tau \), \(kF_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )\) is a constant.

4.2 Optimal TOGGLE for label generalization

Given the vertex labels \(l= (l_{\tau ,1}, l_{\tau ,2},\ldots , l_{\tau ,n_\tau })\) for the \(\tau \)-th vertex type, label generalization combines them into \(m_\tau \) groups, each with \(r_\tau \) labels. It is equivalent to finding a permutation \(P=(l_{\tau ,p_1}, l_{\tau ,p_2}\), \(\ldots \), \(l_{\tau ,p_{n_\tau }})\) of l and then sequentially dividing the permutated labels into \(m_\tau \) groups, each satisfying t-closeness. Formally,

$$ EMD_j \Big (l, \{l_{\tau ,p_{r_\tau (j-1)+i}} { | i\in [1,r_\tau ]}\} \Big ) \le t, \forall j\in [1,m_\tau ]. $$

Meanwhile, we want TOGGLE to minimize Expression (3) so that this label generalization can lead to a minimum search space. Combining the above, TOGGLE can be formulated as a combinational optimization problem with constraints.

$$\begin{aligned} {{{\underline{\varvec{TOGGLE}}}}}~~~~~~~~\mathop {{{\,\mathrm{argmin}\,}}}_{P} \sum _{j=1}^{m_\tau } \Bigg [\sum _{i=1}^{r_\tau }F_{G}^l(\tau , p_{r_\tau (j-1)+i}) \Bigg ]^2 \end{aligned}$$

subjects to

$$ EMD_j \Big (l, \{l_{\tau ,p_{r_\tau (j-1)+i}} { | i\in [1,r_\tau ]}\} \Big ) \le t, j\in [1,m_\tau ]. $$

Note that the objective function as above is a simplified version of Expression (3) by assuming query graphs and data graphs are independent and identically distributed. And the vertex type \(\tau \) is omitted whenever it is fixed.

This optimization problem is challenging as there are a huge number of permutations, and in each permutation the Earth’s Mover Distance of all \(m_\tau \) label groups need to be calculated. To address this, we reduce our optimization problem to the General Set Partitioning Problem (GSPP) [21]. Given a universe of n elements (\(1,2,\ldots ,n\)), there is a rule \(\mathbb {Y}\) that generates feasible subsets \(y_j\) of these elements, \(\mathbb {J}=\{y_j\}\). Each subset \(y_j\) has a cost \(c(y_j)\), and GSPP divides all elements into subsets with the minimum cost. Formally,

Definition 6

(General Set Partitioning Problem) Let \(\lambda _j \in \{0,1\}\) denote whether subset \(y_j\) is a result subset. \(y_{i,j}\)=1 if \(y_j\) contains element i, and 0 otherwise. Then, GSPP finds \({{\,\mathrm{argmin}\,}}\sum _{y_j \in \mathbb {J}}{c(y_j)\lambda _j}\) subjects to \(\sum _{y_j \in \mathbb {J}}{y_{i,j}\lambda _j} = 1, i=1,2,\ldots ,n\).

In TOGGLE, the universe has n labels denoted as l. Each permutation P partitions the labels into label groups, and each has \(r\) \(=\) \(n/m\) labels. The j-th label group \(\{l_{p_{r(j-1)+i}} { | i\in [1,r]}\}\) is a subset \(y_j\). The rule \(\mathbb {Y}\) is t-closeness, \(EMD_j(l,y_j)\) \(\le \) t. The cost is the search space, i.e., \(c(y_j)=(\sum _{i=1}^{n} F^l_G(y_{i,j}))^{2}\), where \(F^l_G(y_{i,j})\) is the probability of label i in \(y_j\). Specifically, if \(y_{i,j}=1\), \(F^l_G(y_{i,j}) = F^l_G(i)\), otherwise \(F^l_G(y_{i,j}) =0\).

Therefore, the optimization problem of TOGGLE can be reduced to GSPP as follows:

$$\begin{aligned} {{{\underline{\varvec{TOGGLE}}}}~ under~\mathbf{GSPP} }~~~\min \sum _{y_j \in \mathbb {J}}{\lambda _j\underbrace{c(y_j)}_{(\sum _{i=1}^{n}F^l_G(y_{i,j}))^{2}} } \end{aligned}$$
(4)

subjects to \(\sum _{y_j \in \mathbb {J}}{y_{i,j}\lambda _j} = 1, i=1,2,\ldots ,n\), and \(\lambda _j \in \{0,1\}\). The original TOGGLE problem has \(\frac{n! }{((n/m)!)^{m}}\) permutations, wherein the GSPP reduction effectively shrinks the size \(|\mathbb {J}|\) to \(\frac{n! }{(n/m)!(n-n/m)!}\).

4.3 Sub-optimal TOGGLE for label generalization

Since the asymptotic size of GSPP is still exponential in terms of n, finding the optimal partition is only feasible for small n.Footnote 5 In what follows, we propose a sub-optimal solution with a theoretical guarantee. In particular, we first transform the original GSPP optimization problem in Expression (4) into Linear Programming Master (LPM) problem by relaxing the integer constraint. Then, the initial solution for LPM problem is obtained, based on which an iterative process consisting of a master problem and subproblem is presented. The process terminates until the desirable solution is derived.

figure b

Algorithm 2 outlines the sub-optimal solution. Inspired by Column Generation [21], the algorithm first transforms the original GSPP optimization problem, denoted by \(\mathbf{OP} \), into LPM problem (Line 1) by relaxing the integer constraint \(\lambda _j \in \{0,1\}\) to \(\lambda _j \in [0,1]\).

$$\begin{aligned} {{{\underline{\textit{\textbf{LPM Problem}}}}}}~~~~~~~~~~~~~~~~~~~\min \sum _{y_j \in \mathbb {J}}{c_j \lambda _j} \end{aligned}$$

subjects to \(\sum _{y_j \in \mathbb {J}}{y_{i,j}\lambda _j} = 1, i=1,\ldots ,n\) and \(\lambda _j \in [0,1]\), where \(c_j = c(y_j) = (\sum _{i=1}^{n} F^l_G(y_{i,j}))^{2}\) and \(y_j\) is a subset and is also called a column.

Since it is impossible to obtain all \(|\mathbb {J}|\) feasible columns at once for LPM problem, the algorithm first generates \(|\mathbb {J'}|\) (\(\le |\mathbb {J}|\)) feasible columns as an initial solution (Lines 2 to 8). With these columns, a master problem is formulated and solved by the Simplex method (Lines 12 to 13). The optimal solution is then used to formulate a dual subproblem, generate a new feasible column \(y^*_j\) (Lines 14 to 15) and append it to the resulted columns (Line 12). The algorithm then iteratively solves the master problem and subproblem until there is no column with a desired reduced cost (Line 11). Finally, it applies the branch and bound algorithm [27] to obtain the integer solution to the partition of labels (Line 16).

In what follows, we elaborate them and prove the theoretical error bound of this approximation algorithm.

4.3.1 Initial solution (Lines 2 to 8)

Let \(l=(l_1,l_2,\ldots ,l_n)\) be n labels sorted by their values and \((1/n,1/n,\ldots ,1/n)\) be their distribution masses. Each feasible column \(y_j\) should contain n/m labels \((e_1,\ldots ,e_\alpha ,\ldots ,e_{n/m})\) with evenly distributed mass \((m/n,m/n,\ldots ,\) m/n). So it should generate m feasible columns “aligned” with l to calculate EMD. Here, “aligned” means \(e_\alpha \) transports distribution mass to \(l_i\). We further define an Alignment Group as a group of consecutive \(l_i\) in l. If both l and \(e_\alpha \) are sorted by their values, Lemma 2 shows their alignment relationship.

From Lemma 2, we can derive that if we select an element \(e_\alpha \) for \(y_j\) from one of the positions \([(\alpha -1)m+1,\alpha m]\) of l, the ground distance between \(e_\alpha \) and alignment group \(\alpha \) is at most \(\frac{n^2-n(n/m) }{2(n/m)^2(n-1)}\). Based on this lemma, we propose the following heuristic to obtain the initial solution. When \(t < \frac{m-1}{2(n-1)}\), we design a procedure ModifiedBasSol to find the exact optimal partition that satisfies t-closeness by exhaustive search (Line 7). The procedure is similar to the baseline solution in Algorithm 1, but it uses Lemma 2 to calculate EMD. When \(t \ge \frac{m-1}{2(n-1)}\), we group \(\{l_j\), \(l_{j+m}\), \(\ldots \), \(l_{j+(n/m-1)m}\}\) into a label group \(y'_j\) and further generate \(y_j\) based on \(y'_j\) by setting \(y_{i,j}=1\) if \(l_i\in y'_j\), otherwise \(y_{i,j}=0\) (Lines 2 to 6).

4.3.2 Master problem (Lines 12 to 13)

With the generated \(|\mathbb {J'}|\) columns, we further reformulate the LPM problem to a Restricted Linear Programming Master (RLPM) problem and obtain an optimal solution to it and its dual problem (Lines 12 and 13).

$$\begin{aligned} {{{\underline{\varvec{RLPM Problem}}}}}\quad \min \sum _{y_j \in \mathbb {J'}}{c_j \lambda _j} \end{aligned}$$

subjects to \(\sum _{y_j \in \mathbb {J'}}{y_{i,j}\lambda _j} = 1, i=1,\ldots ,n\) and \(\lambda _j \in [0,1]\), where \(c_j = c(y_j) = (\sum _{i=1}^{n} F^l_G(y_{i,j}))^{2}\) and \(y_j\) is a column generated from the rule \(\mathbb {Y}\). Note that \(y_j\) is limited to \(\mathbb {J'}\) instead of \(\mathbb {J}\).

RLPM problem is a simple linear programming problem with only \(|\mathbb {J'}|\) feasible columns, so it can be easily solved by employing the Simplex method (Line 13).

4.3.3 Subproblem (Lines 14 to 15)

Let \(\mu \) denote the dual optimal solution to the RLPM problem. Subproblem is to find a new feasible column \(y^*_j\) that minimizes the reduced cost \(\{c(y_j) - \mu y_j\}\) and meets the rule \(\mathbb {Y}\), or formally

$$\begin{aligned} y^*_j = \mathop {{{\,\mathrm{argmax}\,}}}\limits _{y_j} \{\mu y_j - c(y_j) \} \end{aligned}$$

subjects to \(EMD(l,y_j)\le t\), where \(c(y_j)\) \(=\) \((\sum _{i=1}^{n}F^l_G(y_{i,j}))^{2}\).

The subproblem is obviously NP-hard. We propose a sub-optimal solution by reducing the subproblem to a \(0-1\) Quadratic Knapsack Problem (QKP) (Line 14):

$$\begin{aligned} {{{\underline{\varvec{0 -- 1 QKP}}}}}~~~~~~~~~~~~~~~~~~~\max \{\mu y_j - c(y_j)\} \end{aligned}$$

subjects to

$$\begin{aligned} {{{\underline{\varvec{QKP Constraints}}}}}~~~\sum _{\beta =1}^{m}y_{(\alpha -1)m+\beta ,j}=1,~~\alpha =1,\ldots ,\frac{n}{m} \end{aligned}$$

The \(0-1\) QKP can be solved by CPLEX optimizer [28] as a quadratic programming problem (Line 15).

4.3.4 Analysis on TOGGLE

In this subsection, we give a detailed theoretical analysis for correctness, performance guarantee and complexity.

Correctness analysis We can theoretically prove that the generalized labels generated by TOGGLE satisfy t-closeness. To this end, we need to prove that any generalized label in the initial solution (i.e., \(y_j\)) and that in the subproblem (i.e., \(y^*_j\)) achieve t-closeness.

Theorem 2

When \(t \ge \frac{m-1}{2(n-1)}\), the column \(y_j\) or \(y^*_j\) generated as above satisfies t-closeness, where n and m are number of labels and label groups, respectively.

To sketch this proof, we first introduce two lemmas. In particular, Lemma 1 paves the way for the proof of Lemma 2 and the latter proves the range of the ground distance. Their proofs are given in Appendix A.

Lemma 1

The \(\alpha \)-th element \(e_\alpha \) of \(y_j\) is aligned with \(\alpha \)-th alignment group of l, and \(e_\alpha \) needs to transport 1/n distribution mass to each element in alignment group \(\alpha \).

Lemma 2

The ground distance between \(e_\alpha \) and the alignment group \(\alpha \) (denoted by \(g_\alpha \)) is

$$\begin{aligned} Dist(e_\alpha ,g_\alpha )=\frac{1}{n-1}\sum _{\beta =1}^{m}\Big | i-(\alpha -1)m - \beta \Big |. \end{aligned}$$

For \(\forall i \in [(\alpha -1)m+1, (\alpha -1)m+m ]\), if m is odd, \(Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - \frac{n}{m}^2}{4(n-1)\frac{n}{m}^2}\), \(\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]\). Otherwise, \(Dist(e_\alpha \), \(g_\alpha ) \in [\frac{n^2}{4(n-1)\frac{n}{m}^2}\), \(\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]\).

Performance guarantee We can also prove that the (sub-)optimal TOGGLE has a theoretical guarantee. In particular, if \(t < \frac{m-1}{2(n-1)}\), (sub-)optimal TOGGLE can find the exact solution. Otherwise, it can provide a \((1+m/5)\)-near optimal solution.

Theorem 3

When \(t \ge \frac{m-1}{2(n-1)}\), sub-optimal TOGGLE provide a \((1+\epsilon )\)-approximate solution to the original problem where \(\epsilon \) is approximately m/5.

To sketch this proof, we introduce two lemmas. The first gives an approximation bound to the initial solution, and the second gives an approximation bound to the subproblem.

Lemma 3

When \(t \ge \frac{m-1}{2(n-1)}\), the initial solution has an approximation ratio \(\big (1+\frac{m^2-m+2}{(m+1)(ln(n)+0.5772+1/(2n))}\big )\) to the optimal solution.

Lemma 4

When \(t \ge \frac{m-1}{2(n-1)}\), the reduction to \(0-1\) QKP provides a \(\frac{4}{(ln(n)+0.5772+1/(2n))m}\)-approximate solution to the original subproblem.

Complexity analysis We can also prove that the space complexity is linear to number of labels n, and worst case time complexity is directly proportional to \(\frac{n! }{(n/m)!(n-n/m)!}\).

In particular, the space complexity is O(n). As for time complexity, if \(t \ge \frac{m-1}{2(n-1)}\), let \(\omega \), \(t_1\) and \(t_2\) denote the number of iterations, the time cost for one Simplex invocation and the time cost for one QKP solver, respectively. Then, the total time complexity of TOGGLE is \(O(\omega (t_1+t_2))\). Otherwise, the time complexity is \(\frac{n!t_3 }{(n/m)!(n-n/m)!}\), where \(t_3\) is the time cost to calculate EMD using Lemma 2. Note that the time complexity of (sub-)optimal TOGGLE is far less that of the baseline solution (i.e., \(O(\frac{n!t_1}{((n/m)!)^{m}})\), see Section 3), which is exponential to n. Moreover, the value of \(\frac{n!}{(n/m)!(n-n/m)!}\) is relatively small as n is the number of labels in a single attribute whose value is small in real-world dataset (see Section 6).

5 PGP subgraph matching algorithm

Although subgraph matching has been intensively studied [1, 29, 30], the only algorithm that can work on an outsourced succinct graph \({\widetilde{G}}^k\) is the star-based algorithm [3] described in Section 3. However, this algorithm needs to decompose the original query into multiple sub-queries, which is significantly inefficient. In this section, we propose the partial-graph-based subgraph processing algorithm PGP that no longer needs query decomposition. In the outsourced graph \({\widetilde{G}}^k\), boundary nodes are those “1-hop neighboring nodes” and interior nodes are the others. The challenge in processing query on \({\widetilde{G}}^k\) is to retrieve those neighbors for boundary nodes whose neighbors are partially in the succinct graph. To address this problem, PGP exploits the symmetry of the outsourced graph to retrieve matchings for those nodes. Its pseudo-codes are shown in Algorithm 3. Note that T and \(T_p\) denote the final matched subgraphs and the partial embedding of a single complete matching, respectively.

figure c

In Algorithm 3, it first initializes T and \(T_p\) with empty sets and marks all data nodes in \({\widetilde{G}}^k\) as unvisited (Line 2). Then, all matching candidates are generated for each query node of \({\widetilde{Q}}\) by comparing their labels, degrees and neighbors’ labels with those of interior nodes in \({\widetilde{G}}^k\), and expanding them to \(G^k\) (Line 3). Specifically, for each query node u of \({\widetilde{Q}}\), its interior-node candidate v is selected if all three conditions are met: (1) v contains the same vertex type and label group as u, (2) u’s degree is no more than that of v, and (3) labels of u’s neighbors are subset of those of v’s neighbors. Next, the algorithm expands the candidates to the other blocks of \(G^k\) and forms the candidate set. After that, it generates a sorted list \(L_q\) to store the access order for each query node of \({\widetilde{Q}}\) according to its selectivity, the ratio between the number of candidate nodes and its degree (Line 4). Finally, it triggers the subroutine PartialSQ to retrieve all matched subgraphs.

This subroutine first determines whether the partial embedding \(T_p\) is a complete matching by comparing the size of \(T_p\) with that of \({\widetilde{Q}}\). If so, \(T_p\) is inserted to T (Lines 7 to 8). Then, it recursively processes the query (Lines 10 to 17) as follows. It finds the query node u in depth d from \(L_q\) with its farther node \(u_f\) and refines the candidates for matching with u by calling subroutine refineV (Lines 10 to 11). refineV confines \(u's\) candidate v to the neighbors of \(v_f\) which is already matched with \(u_f\). Specifically, if \(v_f\) is an interior node, v is selected as the candidate vertex when v is one of neighbors of \(v_f\). If \(v_f\) is a boundary node, there are three steps: (1) finding the symmetry vertex \(v_f'\) from interior nodes for \(v_f\), (2) retrieving all neighbors of \(v_f'\); and (3) determining if v can be mapped with one of those neighbors by automorphic function. (If so, v is a candidate vertex of u.) Then, for each candidate \(v \in C_s\), it determines whether v can be matched with u in current partial embedding \(T_p\) by judging whether all paired vertices of u’s neighbors in \(T_p\) are also the neighbors of v (Lines 12 to 13). If v can be matched, the paired u and v will be inserted into \(T_p\) (Line 14). Then, PartialSQ is recursively called to access the next vertex in \({\widetilde{Q}}\) (Line 15).

Utility analysis As for the utility, two metrics should be considered. One is recall, i.e., the fraction of correct results among all true results for the query. Fortunately, all matchings retrieved by PGP can cover all true matchings of Q over G, as (kt)-privacy model does not drop any edge or node. As such, the recall is always 100%. The other is precision, i.e., the fraction of correct results among the retrieved matchings. The precision of all matchings retrieved by PGP is the same as that of the baseline solution.

Complexity analysis The space complexity is O(|T|) where |T| is the number of matchings. The time complexity is \(O(|C_s|^{|V({\widetilde{Q}})|})\) where \(V({\widetilde{Q}})\) denotes the set of vertices of query \({\widetilde{Q}}\).

6 Experimental results

6.1 Experimental setup

We evaluate the performance of proposed TOGGLE (denoted by TOG) and PGP algorithms against the baseline solution for label generalization (denoted by BAS) and star-based subgraph processing (denoted by SGP) described in Section 3 (codes provided by courtesy of the authors of [3]). To evaluate the privacy and utility, we further compare our methods with state-of-the-art techniques. All algorithms are written in C++ except for TOGGLE, which is implemented in MATLAB. The client is a desktop computer with Intel(R) Core(TM) i5-6600 CPU machine and 16GB RAM running Windows 10 Pro. The cloud server is a Windows Server 2016 Datacenter with 4 CPU cores and 64GB RAM.

Datasets We use three real graph datasets with increasing sizes: Web-NotreDame, DBpedia and UK-2002. Their statistics are given in Table 2 where \(N_t\), \(N_a\) and \(N_l\) indicate number of types, attributes and labels, respectively. In particular, the maximum numbers of labels in a single attribute are 200, 985 and 1000, respectively. Query graphs are generated from the data graph by the random walk scheme [31]. For each dataset, 1000 query graphs with sizes in the range of [4-30] are randomly generated.

Table 2 Datasets
Table 3 Parameter settings for experiments

Performance metrics In each experiment, we measure the Time Cost, Model Cost and Approximation ratio. Time Cost is the clock time to complete the experiment. Model Cost is the value of the objective function (Expression (4)) for a label generalization algorithm. Approximation ratio is the ratio of Model Cost of a sub-optimal solution to that of the optimal solution.

Parameter settings Unless otherwise stated, we use the default parameter values in Table 3. We choose 4 values \(t_i,i\in \{1,2,3,4\}\) for t in order to cover all possible cases of approximation ratio as stated in the proof of Theorem 2.

6.2 Performance of sub-optimal TOGGLE

We first compare the performance of TOGGLE with that of BAS under default settings. As we will see, since BAS is very time-consuming, we use only 12 labels in our default settings. Figure 5a shows the time cost on UK-2002 with 12 labels (denoted by UK-2002(12)). From this figure, we find that the time cost of TOG is far less than that of BAS regardless of t. In addition, we observe a dramatic increase in time cost of BAS as t increases from \(t_1\) to \(t_4\). This is because the number of label groups satisfying t-closeness increases when t is relaxed from \(t_1\) to \(t_4\). From Fig. 5b, we observe that the model cost of TOG is the same as that of BAS, which indicates that approximation ratio of TOG on UK-2002(12) can reach 1. Figure 6 shows similar results on the DBpedia(12) dataset. This shows that TOGGLE not only has a theoretical approximation bound, but is also close to the optimal in practice. In what follows, we further evaluate its performance by varying group size and number of labels.

Fig. 5
figure 5

Time Cost and Model Cost on UK-2002(12)

Fig. 6
figure 6

Time Cost and Model Cost on DBpedia(12)

Impacts of group size Figure 7 shows the time cost on UK-2002(12) by varying the group size. In this figure, we use \(TOG_1\) to denote the time cost of TOG when \(t=t_1\); for \(t=t_2\), \(t_3\) or \(t_4\), we use TOG as it acts the same. Similarly, \(BAS_i\) denotes the time cost of BAS when \(t=t_i\). We observe that the time cost of TOG is far less than that of BAS especially for \(t=t_3\) and \(t_4\). When the group size increases, the time of BAS decreases because its search space is proportional to the number of different permutations. On the other hand, TOG fluctuates within a narrow range within 1 second.

Fig. 7
figure 7

Time Cost vs. Group Size on UK-2002(12)

Fig. 8
figure 8

Freq on UK-2002(12) and DBpedia(12)

Figure 9 presents the corresponding model costs on UK-2002(12). There is one missing result at \((6,t_1)\) because there is no feasible label group in this setting. It can be seen from this figure that increasing group size will lead to the increase in model cost. Nonetheless, the model cost of TOG is almost identical to that of \(BAS_i\).

Fig. 9
figure 9

Model Cost vs. Group Size on UK-2002(12)

We also plot the frequency distribution of approximation ratio (denoted by Freq) on various group sizes and t setting for both datasets in Fig. 8. Overall even the worst case leads to an approximation ratio lower than 1.1. As for the best case, more than \(50\%\) of cases can achieve a ratio equal to 1.

Fig. 10
figure 10

Time cost versus number of labels on UK-2002

Fig. 11
figure 11

Freq on UK-2002 and DBpedia

Impacts of label size We vary the number of labels from 10 to 1000 (i.e., the maximum number of labels in a single attribute) on UK-2002 and compare their results. As shown in Fig. 10, the time cost of BAS is rising exponentially with respect to the number of labels, and it already becomes prohibitively costly (e.g., more than 1.5 days) for only 16 labels at \(t \ge t_3\). On the other hand, TOG is less sensitive to the increase in number of labels. In fact, the time cost of TOG reaches its peak at 400 and slowly decreases afterward, which indicates that TOG can easily scale to even more labels. Figure 12 presents the corresponding model costs on UK-2002. It can be seen from this figure that the model cost of TOG is almost equivalent to that of BAS. This shows TOG can achieve robust and desirable performance irrespective of label size.

Fig. 12
figure 12

Model Cost vs. Number of Labels on UK-2002

As for the approximation ratio, we plot the frequency distribution of approximation ratios (denoted by Freq) of TOG for both UK-2002 and DBpedia in Fig. 11. We find the approximation ratios are no more than 1.1 and \(80\%\) of them are exactly 1.

To sum up, our privacy-preserving label generalization algorithm TOGGLE achieves high efficiency, effectiveness and scalability.

6.3 Performance of PGP algorithm

We compare our PGP algorithm with SGP, the star-based subgraph processing algorithm. Figure 13a presents the time cost for the Web-NotreDame dataset. We observe that as the query size increases, the time cost of PGP grows much slower than that of SGP. This is because the number of star matchings undergoes a huge rise as shown in Table 4. We then fix the query size to 6 and vary the value of k from 2 to 6. As shown in Fig. 13b, the time cost increases with k, because a larger k introduces more redundant edges to the k-automorphic data graph, which leads to more matchings for a single query. Nonetheless, PGP takes less time cost than SGP, since the latter incurs significant time to compute natural join for a large number of matched stars. Figure 13(c) through (f) shows similar results for the other two datasets. As such, we conclude that PGP always outperforms SGP under various k and query sizes. Furthermore, the gain of PGP becomes more evident for larger k or query sizes.

Table 4 # of Star matchings on Web-NotreDame (k=2)
Fig. 13
figure 13

Time Cost vs. |V(Q)| and Time Cost vs. k

Fig. 14
figure 14

Degree frequencies

Fig. 15
figure 15

Number of matchings (k=6)

Table 5 Success ratio under similarity attack and majority attack

6.4 Performance of privacy–utility

We further compare our work with some existing techniques in terms of privacy and utility on Web-NotreDame.

Power of privacy protection We compare anonymized graphs produced by our work (TOG) and other classical anonymization methods [3, 13, 15] under different attacks such as degree attack, subgraph attack and similarity attack. For any vertex degree d of the anonymized graph produced by our work, we report its frequency of vertices with degree d. As depicted in Fig. 14, the minimal degree frequency is k, which indicates that our method can guarantee privacy under degree attack. We further test our method under subgraph attacks. To this end, we randomly extract some subgraphs from the original graph as query graphs and retrieve the matchings for each query to determine if the number of matching is at least k. As shown in Fig. 15, the number of matchings for each subgraph query over the anonymized graph produced by our work is at least k. This indicates that our method can make the anonymized graph defend against subgraph attack. Similarly, we can also find that Deg [13] and Nei [15] suffer from subgraph attack. The reason is that these two algorithms only consider a single type of attack, i.e., degree-attack or 1-neighbor-graph attack, respectively. We further compare TOG with BAS [3] under similarity attack and majority attack (i.e., simply predict the target node’s label based on the majority label of the neighbors). Table 5 presents the success ratios under similarity attack and majority attack (denoted by SucRatio(SimAttack) and SucRatio(MajAttack)), which refer to the ratios of that one attacker can infer the sensitive label using the attack techniques. We find that the success ratio of the similarity attack on anonymized graph produced by TOG is 0, while that on anonymized graph produced by BAS is close to 50% . We can also find that SucRatio(MajAttack) of both TOG and BAS is close to 1.0%, which is equal to the probability that randomly selects a label (from 100 labels) as the label of a target node. Therefore, we can conclude that TOG can guarantee privacy under degree attack, subgraph attack and similarity attack.

Fig. 16
figure 16

Degree frequencies on ERNet

Fig. 17
figure 17

Degree frequencies on SFNet

Fig. 18
figure 18

Number of matchings on ERNet (k=6)

Fig. 19
figure 19

Number of matchings on SFNet (k=6)

Table 6 Success ratio under similarity attack and majority attack on ERNet

In addition, we further evaluate our methods using synthetic data sets and compare them with the existing state-of-the-art techniques. We use a software called Pajek ( http://vlado.fmf.unilj.si/pub/networks/pajek/) to generate two kinds of random graphs, Erdos–Renyi Graph and Scale-Free Graph.

  1. 1.

    Erdos–Renyi Graph (denoted by ERNet): This graph can be generated by a random graph model, which defines a random graph as N vertices connected by M edges that are chosen randomly from the \(N(N-1)\) possible edges. Pajek can generate it by setting the number of vertices N and average degree d. In our experiments, we set \(N=1000\) and \(d=10\).

  2. 2.

    Scale-Free Graph (denoted by SFNet): A scale-free network is a network whose degree distribution follows or asymptotically follows a power law. In our experiments, we set the number of vertices to be 1049.

As depicted in Figs. 16 and 17, for any vertex degree d of the anonymized graph produced by our method, its frequency of vertices with degree d is at least k. It indicates that our method can defend against degree attack. We further test our method under subgraph attacks by retrieving the matchings for each query. As presented in Figs. 18 and 19, the number of matchings for each subgraph query over the anonymized graph produced by our work is at least k. This indicates that our method can make the anonymized graph defend against subgraph attack. Similarly, we can also find that both Deg [13] and Nei [15] suffer from subgraph attack. We further compare TOG with BAS [3] under similarity attack and majority attack on our synthetic data set ERNet. Table 6 presents the success ratios under similarity attack and majority attack (denoted by SucRatio(SimAttack) and SucRatio(MajAttack)). We find that the success ratio of the similarity attack on anonymized graph produced by TOG is 0, while that on anonymized graph produced by BAS is close to 50% . Observe that SucRatio(MajAttack) of both TOG and BAS is close to 1.0%, which is equal to the probability that randomly selects a label (from 100 labels) as the label of a target node. Therefore, we can conclude that TOG can guarantee privacy under degree attack, subgraph attack and similarity attack.

Table 7 Running time and storage space (PGP V.S. SQ)

Utility Evaluation. To evaluate the utility of our method (PGP), we further compare it with classical subgraph matching methods over the entire graph. Since both of \({\widetilde{G}}^k\) and Alignment Vertex Table (AVT) are outsourced to the cloud during pre-processing, the cloud actually knows all the information of the original \(k-\)automorphic graph \(G^k\). Hence, most subgraph matching algorithms [31,32,33] can be extended to work on the outsourced graph \({\widetilde{G}}^k\) by following steps: (1) Graph Extensions. Given a vertex p of \({\widetilde{G}}^k\), its symmetric vertices P on other blocks can be found by retrieving from Alignment Vertex Table (AVT). Similarly, the symmetric vertices Q of another vertex q of \({\widetilde{G}}^k\) can be easily found. If there is an edge between p and q, an edge should be inserted between vertices pair of \(p'\) and \(q'\) where \(p' \in P\) and \(q' \in Q\). (2) Index Construction. Then, the indices for degree and neighborhood signature filtering for candidates as in previous works such as [26] can be constructed. (3) Subgraph Matching. Given the extended graph of \({\widetilde{G}}^k\) and the indices, classical subgraph matching algorithms such as [31,32,33] can be applied to retrieve subgraph matchings. The results of PGP and that of the classical subgraph matching algorithm [31] (denoted by SQ) on Web-NotreDame are shown in Table 7. Under the same privacy (the same k), we can find that PGP outperforms SQ w.r.t both running time and storage space. In particular, the time and space savings are up to tenfold and fivefold, respectively.

Table 8 Recall
Fig. 20
figure 20

Precision

We further evaluate the recall and precision of PGP, SQ and SGP. Recall that recall is the fraction of correct results among all true results for the query and precision is the fraction of correct results among the retrieved matchings. As reported in Table 8, recall is always 100% since none of edges and nodes are dropped in the generalization process. As for precision, these methods obtain the same performance due to that all of them can be deemed as subgraph processing on a \(k-\)automorphic graph \(G^k\) (Fig. 20). In addition, the precision decreases as the value of k increases, since more dummy edges will be introduced as the value of k increases. The observations further justify the utility analysis in Section 5 (Utility Analysis). In general, more stringent privacy will be obtained as the value of k increases. Meanwhile, more running time and storage space are taken for subgraph processing, and worse precision is obtained.

Table 9 Summary on related work

7 Related work

The most germane to this research is privacy-preserving graph data publication and anonymization, and privacy-preserving graph query processing in the cloud, which is summarized in Table 9.

Privacy-preserving graph data publication and anonymization Many structural privacy-preserving mechanisms have been developed [17,18,19] by exploiting the symmetry of the graph data to guard against various attacks such as degree attack, subgraph attack and hub-fingerprint attack [13, 15, 16]. In particular, the k-automorphism method by Zou et al. constructs \((k-1)\) symmetric vertices for each existing vertex and is claimed to defend against any graph structural attacks [17]. When sensitive labels are considered, the k-anonymity and \(\ell \)-diversity have been adopted for graphs [3, 12]. In addition, they can also preserve the structure privacy. However, [12] can only defend against some specific structural attack (e.g., degree attack) or label attack on simple labels. They are both vulnerable to advanced attacks, such as similarity attack. Differential privacy (DP) [11] as a more stringent privacy model has been widely studied to protect against the privacy disclosure of statistical databases. It ensures that query results on a dataset are insensitive to the change of a single record. Owing to its unique strengths, it has been applied to graph data analysis, which can be grouped into two categories: edge-DP [34,35,36,37]- and node-DP [39,40,41,42]- based methods. In edge-DP (resp. node-DP), two graphs are neighboring if they differ on a single edge (resp. node). In particular, [34] publishes degree sequence of a graph under edge-DP by adopting the Laplace mechanism where sensitive is 2. [35, 36] have also considered publishing the degree sequence or distribution by extending the technique to synthetic generation. [37] proposes a framework for graph metric estimation with local differential privacy. When the setting moves to node-DP, the sensitivity of degree distribution becomes \(2(|V|-1)\) since removing a node may have effect on other \(|V|-1\) nodes. Hence, [42] explores the projection approach to reduce the sensitivity when publishing the degree distribution of a graph under node-DP, while [41] adopts a truncation approach to reduce the sensitivity. Instead of publishing degree sequence or distribution, differentially private subgraph counting queries under node-DP have been studied [40, 42]. For example, [42] develops a novel method for differentially private triangle counting in large graphs. Recent efforts have investigated the problem of publishing statistics of attributed graphs or spatiotemporal graphs [38, 43, 44]. The neighboring data in [44] are the records that differ in the presence of a single edge or in the attribute vector associated with a single node. In contrast, neighboring data in [43] are defined on infinite series data. As the perturbation is required to satisfy differential privacy, noises such as Laplace noise [34, 43] need to be injected into graphs or their statistics. This makes DP-based techniques insufficient or even infeasible in subgraph matching where exact matchings are desirable.

There are a lot of other graph data anonymization and publication techniques [16, 45,46,47]. For example, to protect a graph from link re-identification, Zheleva & Getoor propose five different privacy preservation strategies, which vary in terms of the amount of data removed and the amount of privacy preserved [45], while Campan & Truta tries to mask the graph data according to the k-anonymity model, in terms of both nodes’ attributes and nodes’-associated structural information [46]. However, they cannot apply to our case since (1) some of them can not protect both structural and label privacy (e.g., [45] can only protect the graph from link re-identification). (2) some of them need to delete edges from the original data graph [45, 46]. Hence, they are infeasible in subgraph matching where exact matchings are desirable.

Privacy-preserving graph query processing in cloud There are a lot of privacy-preserving methods or frameworks for diverse queries. In particular, Fan et al. [48] propose an asymmetric structure-preserving subgraph query processing method which is the first practical private approach for subgraph query services, where the data graph is publicly known and the query structure/topology is kept secret. [50] develops specific schemes for shortest path queries, which achieve much better performance and scalability with a strong privacy property in practical scenarios. [49] proposes a method to efficiently compute the shortest distance in large outsourced graphs without compromising their sensitive information. Ma et al. [51] investigate the shortest path sketch by defining a top-k critical vertices query on the shortest path. A novel graph encryption scheme that enables approximate constrained shortest distance querying is studied in [52]. However, all of them are privacy-preserving schemes for (approximate) shortest path queries or top-k critical vertices query on shortest path, which do not apply to our case where the goal is to find subgraph matches on a single large attributed graph. [4] presents a method to answer the subgraph matching over a set of encrypted small graphs instead of a single large graph. [53] develops a novel k-decomposition algorithm and a new information loss matrix designed for utility measurement in massively large graph datasets. However, it cannot protect label privacy since only structure privacy is considered. The most germane to this research is [3, 59] which develop privacy-preserving sub-graph matching methods on large attributed graphs in cloud. Unfortunately, they are vulnerable to similarity attack and suffer fromlowefficiency. Other techniques on other privacy-preserving graph queries such as reachability query [60] and kNN query [61]. Nevertheless, they cannot be adapted to subgraph matching on a single large attributed graph.

8 Conclusion

In this paper, we propose a graph label generalization algorithm and an efficient subgraph matching algorithm in cloud with t-closeness and k-automorphism privacy. The former achieves a theoretically guaranteed approximation ratio of \((1+\epsilon )\) where \(\epsilon \) is approximately 0.2 times the number of label groups. The latter exploits the symmetry of the generalized graph and limits the search scope to a localized region. As for future work, we plan to extend this solution framework to a wide range of classic graph queries, such as maximal clique search and best region search.