Privacy and efficiency guaranteed social subgraph matching

Huang, Kai; Hu, Haibo; Zhou, Shuigeng; Guan, Jihong; Ye, Qingqing; Zhou, Xiaofang

doi:10.1007/s00778-021-00706-0

Privacy and efficiency guaranteed social subgraph matching

Regular Paper
Published: 11 November 2021

Volume 31, pages 581–602, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The VLDB Journal Aims and scope Submit manuscript

Privacy and efficiency guaranteed social subgraph matching

Download PDF

Kai Huang ORCID: orcid.org/0000-0001-9857-654X¹,
Haibo Hu^2,3,
Shuigeng Zhou¹,
Jihong Guan⁴,
Qingqing Ye^2,3 &
…
Xiaofang Zhou⁵

1419 Accesses
17 Citations
Explore all metrics

Abstract

Due to the increasing cost of data storage and computation, more and more graphs (e.g., web graphs, social networks) are outsourced and analyzed in the cloud. However, there is growing concern on the privacy of these outsourced graphs at the hands of untrusted cloud providers. Unfortunately, simple label anonymization cannot protect nodes from being re-identified by adversary who knows the graph structure. To address this issue, existing works adopt the k-automorphism model, which constructs $(k-1)$ symmetric vertices for each vertex. It has two disadvantages. First, it significantly enlarges the graphs, which makes graph mining tasks such as subgraph matching extremely inefficient and sometimes infeasible even in the cloud. Second, it cannot protect the privacy of attributes in each node. In this paper, we propose a new privacy model (k, t)-privacy that combines the k-automorphism model for graph structure with the t-closeness privacy model for node label generalization. Besides a stronger privacy guarantee, the paper also optimizes the matching efficiency by (1) an approximate label generalization algorithm TOGGLE with $(1+\epsilon )$ approximation ratio and (2) a new subgraph matching algorithm PGP on succinct k-automorphic graphs without decomposing the query graph.

An Efficient Framework for Multiple Subgraph Pattern Matching Models

Article 22 November 2019

A Privacy-Preserving Framework for Subgraph Pattern Matching in Cloud

A Survey of Privacy Preserving Subgraph Matching Methods

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Attributed graph, a subtype of graph where each node contains a set of attributes (as shown in Fig. 1a), has become increasingly popular to model web and social networks. Many graph queries are developed to analyze and retrieve the rich semantic and structural information of these graphs. Among them the subgraph matching, which retrieves all subgraphs isomorphic to a given query graph, is fundamental to many graph data analytics [1]. However, as graph data size continues to grow, storing and processing them impose expensive upfront infrastructure costs on users. As such, many cloud service providers such as Amazon, Alibaba and Microsoft Azure offer graph outsourcing services by storing graphs owned by users and executing mining tasks on behalf of them. GraphLab [2] even provides a graph-based Software-as-a-Service (SaaS).

However, the cloud server is considered as “honest-but-curious” (a.k.a., semi-honest), which is consistent with most related works in the literature [3,4,5,6]. On the one hand, the cloud server acts in an “honest” fashion, i.e., correctly follows the designated protocol specification such as HIPAA compliance^{Footnote 1} and always offers correct computations without cheating. On the other hand, it is “curious” to infer the sensitive information of the graph data by analyzing index structures, requested queries and query results. This assumption does not reduce the need for storing and processing graphs on the untrusted cloud provider. For example, Paysafe, a Fintech company, needs to store and analyze the payments data including attributes and interactions of customers and their payments (which are modeled as a large graph), but it is unwilling to spend expensive upfront infrastructure costs on it. Hence, the company resorts to a graph-based cloud service, Graph Database and Graph Analytics^{Footnote 2} provided by ORACLE [7], even it is assumed that the cloud provider is “honest-but-curious.” In general, cloud servers can learn the attribute values (a.k.a., labels) and structural information of the outsourced graphs. Privacy breaches on these graphs are known to disclose sensitive information [3], which can be grouped to two categories: content disclosure and identity disclosure. Content disclosure compromises sensitive label information, such as salary, social security number, or medical history of a user. To guard against content disclosure, three classic privacy-preserving techniques have been proposed, namely, k-anonymity [8], $\ell $-diversity [9], and t-closeness [10]. They aim to generalize several labels into a single one (also known as an equivalence class) to hide the sensitive information of a single label so that the attacker can only see the generalized label. In particular, k-anonymity requires that each equivalence class contains at least k records so that each record is indistinguishable with at least $k-1$ other records. $\ell $-diversity requires that the distribution of a sensitive attribute in each equivalence class has at least $\ell $ “well-represented” values. While t-closeness requires the label distribution in each equivalence class is no more than t distance away from that in the whole set of labels. It is known that t-closeness is able to resist more attacks (such as similarity attack) than k-anonymity and $\ell $-diversity and is the most stringent privacy metric among the three.^{Footnote 3} However, when it comes to protecting labels in attributed graphs, while [3] and [12] adopt k-anonymity and $\ell $-diversity, respectively, there is no work that adopts t-closeness. Identity disclosure [13, 14] compromises the location of a target node in the graph even after removing the node’s identity. This disclosure can be caused by various structural attacks [13, 15, 16] such as degree attack, 1-neighbor-graph attack, subgraph attack and hub-fingerprint attack. To guard against these attacks, many structure-privacy-preserving techniques [17,18,19] have been developed to enforce symmetry in an outsourced graph. In particular, [17] proposes the k-automorphism model. Given a graph G, it transforms G into a k-automorphic graph $G^k$ by introducing some noise edges and vertices, where each vertex has at least $(k - 1)$ other symmetric vertices. Hence, there are no structural differences between any vertex and its $(k - 1)$ symmetric vertices. In other words, the attacker cannot distinguish a vertex from the other $(k - 1)$ symmetric vertices. And it is known that k-automorphism strategy can defend against any structure-based attack [17].

Example 1

Consider the graph G in Fig. 1a where each vertex represents an entity. There are three entities, individual entity (p), school entity (s) and company entity (c). The edges in G represent the relations between two entities, such as “Work at” relation, “Work together” relation and “Graduate from” relation. If an adversary knows that one person has a 1-degree neighbor node in the graph G, he can immediately know that node $p_4$ is that person and the related attributes of node $p_4$ are revealed. k-automorphism model can be applied to prevent such structure attack by introducing noise edges to construct a k-automorphic graph $G^k$. For example, $G^k$ in Fig. 2a is a k-automorphic graph of G where $k=2$. In this figure, the noise edges are shown by black dashed lines. Note that each node contains a set of attributes where sensitive information (e.g., salary) is contained. In this case, the privacy model for structural privacy (e.g., k-automorphism) is not sufficient to protect the label privacy of each node. For example, even for the k-automorphic graph $G^k$, if the labels are not anonymized, the adversary can use some prior knowledge or background knowledge (e.g., the “Occupation” attribute in the G, Fig. 1a) to identify the location of an individual. Suppose the adversary can identify an individual as the node $p_1$ or $p_4$ of G in Fig. 1a, then he can conclude that this person’s salary is high (“14,000” or “15,000”) without exactly re-identifying the node. Therefore, when sensitive labels are considered, the t-closeness should be adopted for graphs to generalize each label into a generalized one as shown in Fig. 2(a).

The example above shows the necessity of preserving both structural privacy and label privacy. However, except for [3, 20], which only support k-anonymity, all existing works on attributed graphs consider either structural or label privacy. In this paper, we propose (k, t)-privacy, an integrated privacy model for outsourced graphs. Any graph that satisfies (k, t)-privacy must satisfy both k-automorphism and t-closeness for each generalized label of all node attributes. Intuitively, the (k, t)-privacy model can preserve both structural privacy and label privacy.

Once the graph G is anonymized, a straightforward method is to upload the k-automorphic graph $G^k$ to the cloud and perform subgraph queries over $G^k$. However, there are some major limitations. Firstly, a lot of noise edges and vertices will be introduced to the original graph G when constructing the k-automorphic graph $G^k$. This results in a much larger graph on the cloud side, which leads to much more expensive storage cost and much higher query cost. Although Chang et al. [3] propose to store only a succinct version of k-automorphic graphs, query decomposition is adopted during subgraph matching, which causes an extra result-joining step and incurs a large number of false positive results. Secondly, while imposing t-closeness on the labels in a graph G, different outputs of label generalization have different impacts on not only the privacy strength but also the search space (i.e., number of vertices or matchings to explore for a query, see Section 4). Intuitively, the larger search space will cause expensive query cost. The second drawback is illustrated with the following example.

Example 2

Consider the attribute “Expected Salary” of the graph G in Fig. 1a, the original set of labels $l=(7000$, 8000, 14000, 15000). There are two possible label generalizations. The first generalizes (7000, 8000) as a generalized label “A” and (14000, 15000) as another generalized label “B.” In this case, given the query Q in Fig. 1b and its anonymized form, there are two matchings, $(c_1, p_1, p_2, s_1)$ and $(c_2, p_3, p_4, s_2)$. The second generalization is to generalize (7000, 15000) as “A” and (8000, 14000) as “B.” In this case, there is only one matching $(c_1, p_1, p_2, s_1)$. Obviously, the first generalization will lead to a false positive matching (i.e., $(c_2, p_3, p_4, s_2)$.).

To address subgraph matching efficiency problem due to the enlarged graph size caused by $k-$automorphism and the enlarged search space caused by the generalization using $t-$closeness, we model the cost of subgraph matching and reduce the problem of optimal generalization of labels for $t-$closeness to the General Set Partitioning Problem [21]. Based on this, we propose an efficient approximation algorithm TOGGLE with a bounded approximation ratio of $(1+\epsilon )$. Furthermore, as the cloud only stores a succinct version of k-automorphic graphs, query decomposition is adopted during subgraph matching [3], which causes an extra result-joining step and incurs a large number of false positive results. We design a new subgraph matching algorithm PGP that can directly work on such graphs without decomposition. To summarize, our main contributions are as follows:

(i)
We propose (k, t)-privacy for outsourced graphs, which is, to the best of our knowledge, the most stringent generalization-based privacy model for both graph structure and label generalization.
(ii)
We propose a t-closeness label generalization algorithm TOGGLE that optimizes costs for subgraph matching. It is proved to have an approximation ratio of $(1+\epsilon )$.
(iii)
We also develop a partial-graph-based subgraph processing algorithm PGP without query decomposition. It exploits the symmetry of the outsourced graph and limits search scope to a localized region.
(iv)
We conduct an empirical study on our algorithms in all three datasets ever experimented in the literature and show their high efficiency under various parameter settings.

The remainder of this paper is organized as follows. Section 2 presents the background and problem statement. Section 3 overviews the graph outsourcing and subgraph matching framework with a baseline solution adapted from [3]. In Section 4, we present the label generalization algorithm TOGGLE, and in Section 5, we introduce the PGP subgraph matching algorithm. Experiments and related works are presented in Sections 6 and 7, respectively, followed by a conclusion in Section 8. Formal proofs of lemmas are in Appendix A.

Table 1 List of key notations

Full size table

2 Background and problem statement

In this section, we first introduce the background of two privacy models in (k, t)-privacy, namely t-closeness and k-automorphism. We then present the privacy attacks on graphs and summarize them as the threat model for this paper. Finally, the formal definition of privacy-preserving subgraph matching with (k, t)-privacy is presented. Table 1 lists the key notations and acronyms used in this paper.

2.1 t-Closeness

Generalization, which combines several values into a single one (also known as an equivalence class), has been a popular anonymization technique for labels [3, 22, 23]. The output of a label generalization algorithm is a label correspondence table which maps this correspondence. t-closeness [10] is a privacy metric that imposes a constraint on each equivalence class.

Definition 1

(t-closeness) An equivalence class satisfies t-closeness if and only if the label distribution in this class is no more than t distance away from that in the whole set of labels. If all equivalence classes satisfy t-closeness, the label generalization algorithm satisfies t-closeness.

The distance is measured by the Earth Mover’s Distance (EMD) [24], which is based on the minimum workload to transform one distribution to another. Specifically, each distribution is viewed as a mass of earth, and in each step of the transformation, some earth is moved from one place to another. The moved mass multiplied by the ground distance of this move is the workload of this step and added to the total workload. To find the minimum workload, the EMD computation relies on the well-known combinational optimization problem—balanced transportation. Formally, let ${\mathcal {P}}= ({p}(v_1),{p}(v_2),\ldots ,{p}(v_n))$ and ${\mathcal {P}}'= ({p}(v'_{1}),{p}(v'_{2})$, $\ldots $, ${p}(v'_{m}))$ denote the two distributions and $d_{i,j}$ be the ground distance between ${v}_i$ and $v'_j$,^{Footnote 4} the balanced transportation finds a mass flow $F=\{f_{i,j}\},i\in [1,n], j\in [1,m]$ such that $\sum _{i=1}^{n}\sum _{j=1}^{m} d_{i,j} \times f_{i,j}$ is minimum, subjects to $ f_{i,j} \ge 0, $ and $ \sum _{i=1}^{n}\sum _{j=1}^{m} f_{i,j} = \sum _{i=1}^{n} p(v_i) = \sum _{j=1}^{m} p(v'_{j}) =1. $ Solving the mass flow F, we can obtain the Earth Mover’s Distance between distributions ${\mathcal {P}}$ and ${\mathcal {P}}{'}$ as: $ EMD({\mathcal {P}},{\mathcal {P}}{'})$ $=$ $\sum _{i=1}^{n}\sum _{j=1}^{m}$ $d_{i,j} \times f_{i,j}. $

Example 3

Figure 2b shows a label correspondence table. For attribute “Expected Salary” of graph G in Fig. 1a, the original set of labels $l=(7000$, 8000, 14000, 15000). There are two possible label generalizations. The first combines (7000, 8000) as a label group “A” and (14000, 15000) as another group “B.” In this case, EMD(l, A) $=(0/3+1/3+1/3+2/3)*1/4= 1/3$, where 0/3 is the ground distance $d_{0,0}$ and 1/4 is the mass $f_{0,0}$. Similarly, $EMD(l,B)=1/3$. As such, this method satisfies 1/3-closeness. The second generalization, as illustrated in Fig. 2b, combines (7000, 15000) as “A” and (8000, 14000) as “B.” In this case, EMD(l, A) and EMD(l, B) are both 1/6. Therefore, this method satisfies 1/6-closeness, which preserves more privacy than the first generalization.

2.2 k-Automorphism

k-automorphism is a graph privacy model that can defend existing structural attacks [17], including degree attack, 1-neighbor-graph attack, subgraph attack and hub-fingerprint attack [13, 15, 16]. The idea is to construct $(k-1)$ symmetric blocks for each block (i.e., a subset of vertices and their corresponding edges) in a graph, so that a vertex cannot be distinguished from other $(k-1)$ vertices in those symmetric blocks. The resulted graph is a k-automorphic graph. Converting a graph G to a k-automorphic graph $G^{k}$ involves three steps—graph partition, graph (block) alignment and edge copy. First, the vertices in G are partitioned into k blocks by adopting the metis algorithm [25]. Second, graph (block) alignment selects and aligns those vertices which have the largest degree in each block and then aligns all other vertices in the same block with those vertices in other blocks by their breath-first search (BFS) traversal order. The result is an alignment vertex table (AVT) where each column corresponds to one block and each row denotes the mapping of k vertices. The same mapping can also be recorded by an Automorphic Function. Third and finally, based on AVT, symmetry edges are inserted in other $(k-1)$ blocks for each edge in one block. Crossing edges between two blocks are also copied accordingly.

Example 4

Graph G in Fig. 1a is first partitioned into $k=2$ blocks, $B_1$=$\{p_1,p_2,s_1,c_1\}$ and $B_2$=$\{p_4,p_3,s_2,c_2\}$ (see Fig. 2a). By graph alignment, each vertex in one block is aligned with that in the other. For example, $p_1$ is aligned with $p_4$, so $(p_1,p_4)$ is inserted to the alignment vertex table in Fig. 3a. Equivalently, their mapping is recorded by a k-dimensional automorphic function $F_1$. After all vertices are aligned, the missing symmetry edges between $p_3$ and $p_4$ and between $p_3$ and $s_2$ (the dashed line in Fig. 2a) are inserted. Finally, the crossing edge between $s_1$ and $p_3$ is copied between $p_2$ and $s_2$. The resulted k-automorphic graph $G^{k}$ is shown in Fig. 2a.

2.3 Graph privacy attacks and threat model

In this paper, we study attributed graph, which is an undirected graph with attributes on each node [3].

Definition 2

(Attributed Graph) An attributed graph is defined as $G=(V,E,T,\Gamma ,L)$ where V is a vertex set; E is an edge set; and $T,\Gamma ,L$ denote the sets of vertex types, attributes and labels, respectively. Each vertex has a unique type and one or more attributes (depending on the type). Each attribute takes one from a set of labels as its value.

Example 5

The data graph G in Fig. 1a is an attributed graph with three vertex types, namely “Individual,” “University” and “Company.” “Individual” contains two attributes, namely “Occupation” and “Expected Salary.” The label value of “Expected Salary” of vertex $p_1$ is “15,000.”

In the literature, privacy attacks on outsourced graphs lead to identity disclosure (structural attacks) and content disclosure (label attacks). Structural attacks include degree attack, 1-neighbor-graph attack, subgraph attack and hub-fingerprint attack [13, 15, 16]. Label attacks include background-knowledge attack, homogeneity attack, skewness attack and similarity attack [9, 10]. A common approach in these attacks is to first identify the target node and then unveil its sensitive labels. The following definition abstracts the capability of all known attacks as above and serves as the threat model of this paper.

Definition 3

(Threat model of graph privacy attacks) An attacker has the complete structure information about the target node, including degree, neighbor list and shortest-distances from known nodes (a.k.a., hubs). She also has the complete label information of all nodes except for the target node. Based on such information, she attempts to match the target node in the outsourced graph and then to unveil its label.

2.4 Problem statement

Our problem of privacy-preserving subgraph matching with (k, t)-privacy involves two subproblems.

Definition 4

(Graph Outsourcing Problem with (k, t)-Privacy) Given a data graph G, to compute an outsourced graph to the cloud that satisfies the following two privacy metrics.

(i)
Preserving label privacy by enforcing t-closeness for its labels,
(ii)
Preserving structure privacy by enforcing k-automorphism for its structure.

Privacy guarantee of $\mathbf {(k, t)}$-privacy For any outsourced graph that satisfies (k, t)-privacy, the probability that an attacker correctly re-identifies a target node using any structural information is at most 1/k, as the k-automorphism model ensures each node is structurally indistinguishable from at least $k-1$ other nodes. Even within the 1/k probability of which the attacker re-identifies the target node, she can only learn limited information about the node’s true label as this label has been generalized to a label group whose distribution of underlying labels is at most t distance away from the distribution of all labels in this graph.

Definition 5

(Subgraph matching problem on outsourced graph) Given a data graph G and its corresponding outsourced graph, for a query Q, to retrieve all subgraphs $\{g_i\}$ of G, each of which is subgraph isomorphic to Q and vice versa. $Q=(V_1,E_1,T_1, \Gamma _1,$ $L_1)$ is subgraph isomorphic to $g_i=(V_2,E_2,T_2, \Gamma _2,L_2)$ if and only if there exists at least one injection function $f: Q\rightarrow g_i$ such that

(i)
$\forall u \in V_1, \exists f(u) \in V_2$ such that $T_1(u) = T_2(f(u)) $ and $L_1(u) \subseteq L_2(f(u))$.
(ii)
$\forall (u,v) \in E_1, (f(u),f(v))\in E_2$.

Example 6

Figure 1b presents a query Q. A UIUC spin-off company wants to hunt two employees in a professional social network. The searching criteria are as follows: (1) They both graduated from UIUC, (2) they are working for the same Internet company, and (3) one is an engineer whose expected salary is 15, 000/mo, and the other is an art designer whose expected salary is 7, 000/mo.

Theorem 1

Both graph outsourcing problem with (k, t)-privacy and subgraph matching problem on outsourced graph are NP-Hard.

3 Solution overview

In this section, we overview our solution to graph outsourcing and subgraph matching on outsourced graphs. The workflow of our solution is depicted in Fig. 4. Given a data graph G, the client user anonymizes it to satisfy (k, t)-privacy (step $\textcircled {\small {1}}$). The result consists of a k-automorphism graph $G^{k}$ and a label corresponding table LCT. Since each vertex in $G^{k}$ has $(k-1)$ symmetric vertices in the other $(k-1)$ blocks, the client user only needs to upload a succinct version of k-automorphism graph ${\widetilde{G}}^k$ to cloud (step $\textcircled {\small {2}}$), which is also called an outsourced graph [3]. ${\widetilde{G}}^k$ comprises of vertices and edges in one block together with their 1-hop neighboring vertices and edges. For example, the one inside red dashed circle in Fig. 2a is the succinct ${\widetilde{G}}^k$.

Upon receiving a matching query ${\widetilde{Q}}$ (step $\textcircled {\small {4}}$) transformed from the client’s original query Q (e.g., Fig. 3b from Fig. 1a), the cloud evaluates it based on ${\widetilde{G}}^k$ and LCT. The results are sent back to the client (see step $\textcircled {\small {5}}$) for further refining to filter those false positive results. In the rest of this paper, we only focus on two key components highlighted in Fig. 4, namely graph outsourcing with $\varvec{(k, t)}$-privacy and subgraph matching.

In the rest of this section, we propose a baseline solution by adapting existing work [3] to generalize labels into label groups. Then, we discuss its limitations, based on which we propose two new algorithms for label generalization and subgraph matching in Sections 4 and 5, respectively.

3.1 Baseline solution

3.1.1 Graph outsourcing with (k, t)-privacy

The idea is to work on k-automorphism and t-closeness separately. We first transform the original graph G to a k automorphic graph $G^{k}$ and then adapt existing label generalization algorithm in [3] to satisfy t-closeness metric. Given a vertex type that contains n vertex labels $l=(l_{1}, l_{2},\ldots , l_{n})$, we propose Algorithm 1 to generalize n labels into m groups. It first permutates the set of labels into $P=(l_{p_1}, l_{p_2},\ldots , l_{p_{n}})$ and then sequentially divides P into m groups, each of which contains n/m labels and satisfies t-closeness.

In particular, Algorithm 1 initializes the global minScore and the permutation $\{y_j\}$ (Lines 2 and 3) where $y_j$ is a label group and $ j \in \{1,2,\ldots ,m\}$. Then, the function groupEnum is called recursively to find the optimal permutation by gradually expanding the candidate permutation P (Line 4).

In this groupEnum function, it first checks whether the current P is a complete permutation of size n (Line 6). If so, its cost (e.g., a estimated search space) is calculated by getCost(P) (Line 7). After that, P is updated to $\{y_j\}$ if the cost is lower than the current lowest (Lines 8 and 9). Otherwise, it greedily enumerates n/m labels as a label group from l and records them in the set $P_s$ (Line 11). For each label group $y'_j$ in this set, a checkTclo routine calculates EMD by solving a transportation problem [24] and then checks whether $y'_j$ satisfies t-closeness (Line 13). If it is true, $y'_j$ will be appended to the candidate permutation P and removed from the labels l (Line 14).

Complexity Analysis. The space complexity of Algorithm 1 is O(n). As for time complexity, let $t_1$ denote the time complexity of checkTclo, which is invoked at most $\frac{n!}{((n/m)!)^{m}}$ times. As such, the time complexity is $O(\frac{n!t_1}{((n/m)!)^{m}})$, which is exponential to n.

3.1.2 Subgraph matching

As described in [3], for a subgraph matching query Q at the client, the client first generalizes its vertex labels by the label correspondence table and sends the generalized query ${\widetilde{Q}}$ to the cloud. Since the cloud can only access the succinct graph ${\widetilde{G}}^k$, which consists of one block of $G^{k}$ together with its 1-hop neighbors, it uses a special star-based subgraph matching algorithm. On the cloud side, the algorithm consists of three steps.

(i)
Query decomposition. The cloud first decomposes query ${\widetilde{Q}}$ (e.g., Fig. 3b) into a set of star shapes $\{S_i\}$ (see Fig. 3b), each of which has a root vertex together with its adjacent edges and neighboring vertices.
(ii)
Star matching. The cloud then retrieves matchings for each decomposed star $S_i$ in succinct graph ${\widetilde{G}}^k$, denoted by $R(S_i,{\widetilde{G}}^k)$, and leverages the symmetry of $G^k$ to retrieve the matchings for $S_i$ in $G^k$, denoted by $R(S_i,G^{k})$.
(iii)
Star joining. The cloud starts with a $R(S_i,G^{k})$ and iteratively computes its natural join with $R(S_j,G^{k})$, until all $j \ne i$ are joint up. The results $R({\widetilde{Q}},G^{k})$ are the matchings for ${\widetilde{Q}}$ over $G^{k}$.

On the client side, the algorithm filters false positives in $R({\widetilde{Q}},G^{k})$ using the original graph G and query Q, and obtains the final subgraph matchings R(Q, G).

3.2 Limitations

Though the baseline solution can support privacy-preserving subgraph matching with (k, t)-privacy, it has two main drawbacks. First, the time complexity of label generalization is exponential to the number of labels as all permutations need to be considered. Second, the star-based subgraph matching is also inefficient because: (1) due to query decomposition it cannot narrow down the search scope of the query to a localized region in the graph, and (2) the natural join is computationally intensive. We illustrate the second drawback with an example.

Example 7

Figure 3b shows a generalized query graph ${\widetilde{Q}}$. The baseline algorithm decomposes ${\widetilde{Q}}$ into two stars, $S_1$ and $S_2$. Three matchings are found for $S_1$ on ${\widetilde{G}}^k$, namely, $(p_1,c_1,s_1,p_2)$, $(p_2,c_1,s_1,p_1)$ and $(p_2,c_1$, $s_2,p_1)$. Similarly, three matchings are found for $S_2$ on ${\widetilde{G}}^k$, namely $(c_1,s_1,p_1)$ , $(c_1,s_1,p_2)$ and $(c_1, s_2,p_2)$. As such, the natural join needs to join all six matchings for both stars. Since each matching of $S_1$ should join with all matchings of $S_2$, nine join operations are needed.

To overcome the first drawback, in Section 4, we propose the TOGGLE algorithm (T-closeness-Optimized Graph Generalization on Label Extension) for label generalization. It significantly reduces the search space for (sub)-optimal label generalization. For the second drawback, we present the PGP algorithm (Partial-Graph-based subgraph Processing) in Section 5. It exploits the symmetry of the outsourced graph and eliminates the need for query decomposition.

4 TOGGLE for label generalization

As we discussed in Section 1, different outputs of label generalization have different impacts on the search space and the larger search space will cause expensive query cost. Hence, TOGGLE aims to generalize labels into groups to minimize the search space, which refers to the total number of vertices to explore for a query. To this end, we first study how to estimate the search space of subgraph matching (Section 4.1). We then introduce the optimal TOGGLE for label generalization that minimizes the search space by formulating the minimization problem into a combinational optimization problem with constraints (i.e., TOGGLE problem, Section 4.2). We further present an approximate TOGGLE algorithm with a bounded error in Section 4.3.

4.1 Estimating search space for subgraph matching

To estimate the search space for a query ${\widetilde{Q}}$ over an outsourced graph ${\widetilde{G}}^k$, we assume a general expansion-based graph search [3, 26] as follows. The first vertex q of ${\widetilde{Q}}$ is selected based on degree and neighborhood signature. q is then matched on ${\widetilde{G}}^k$ with any vertex that contains the same vertex type and label group as q. After that, other vertices of ${\widetilde{Q}}$ are matched with neighbors of those already matched in ${\widetilde{G}}^k$. For the first vertex q, the number of vertices to explore can be estimated by the number of vertices in ${\widetilde{G}}^k$ multiplied by the probability of these vertices being the same types and having the same labels as q. For other vertices of ${\widetilde{Q}}$, the number of vertices to explore in ${\widetilde{G}}^k$ is limited to neighbors of those already matched in ${\widetilde{G}}^k$. In the end, the search space of ${\widetilde{Q}}$ over ${\widetilde{G}}^k$ is the product of all these numbers. Formally, for the $\tau $-th vertex type, let $n_\tau $, $m_\tau $ and $p_{r_\tau (j-1)+i}$ denote the number of labels, the number of label groups and the i-th position in the j-th label group, respectively. $r_\tau =n_\tau /m_\tau $. We also use $F_{G}(\tau )$ to denote the probability of vertices in a graph G (e.g., $G^{k}$, ${\widetilde{Q}}$) being the type $\tau $. Similarly, $F_{G}^{l}(\tau ,i)$ and $F_{G}^{g}(\tau ,j)$ denote the probability of the i-th label and the j-th label group (after the label generalization) being this type, respectively.

For first vertex q, its number of matchings can be estimated by the number of vertices in ${\widetilde{G}}^k$ (approximately $\frac{|V(G^{k})|}{k}$) which multiplies the matching probability of q over ${\widetilde{G}}^k$. Formally,

$$\begin{aligned} \frac{|V(G^{k})|}{k}\sum _{\tau =1}^{\mathbb {T}} \Bigg [F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau ) \sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j) \Bigg ] \end{aligned}$$

(1)

where $\mathbb {T}$ is the number of vertex type.

For other vertices, their matches are restricted to neighbors of those already matched in ${\widetilde{G}}^k$. Let $|{\widetilde{Q}}|$ denote the number of vertices in ${\widetilde{Q}}$ and the average degree $D(G^{k})$ approximately represent the number of neighbors of the vertex in $G^k$, the number of matches for the other $|{\widetilde{Q}}|-1$ vertices can be estimated as follows.

$$\begin{aligned} \Bigg \lbrace D(G^{k}) \sum _{\tau =1}^{\mathbb {T}} \Bigg [F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau ) \sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j) \Bigg ]\Bigg \rbrace ^{|{\widetilde{Q}}|-1} \end{aligned}$$

(2)

Therefore, the search space of subgraph query ${\widetilde{Q}}$ is the product of Expressions (1) and (2), which directly correlates with $F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )$ $\sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j)$. Notably, each vertex in $G^k$ has the union of label groups of all its $(k-1)$ symmetric vertices, which implies that $F_{G^{k}}^{g}(\tau ,j)$ will increase by a factor of no more than $k-1$. In other words, assuming that the j-th label group of the $\tau $-th vertex type contains $r_\tau $ labels, $\{l_{\tau ,p_{r_\tau (j-1)+i}} | i\in [1,r_\tau ]\}$, where $l_{\tau ,p_{r_\tau (j-1)+i}}$ is the label locating at position i of this label group, $F_{G^{k}}^{g}(\tau ,j) \le k\sum _{i=1}^{r_\tau }F_{G}^l(\tau , p_{r_\tau (j-1)+i})$ is therefore obtained. Additionally, $F_{{\widetilde{Q}}}^{g}(\tau ,j) = \sum _{i=1}^{r_\tau }F_{Q}^l(\tau , p_{r_\tau (j-1)+i})$ can be easily derived. $F_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )$ $\sum _{j=1}^{m_\tau } F_{G^{k}}^{g}(\tau ,j) F_{{\widetilde{Q}}}^{g}(\tau ,j)$ is therefore bounded by the product of $kF_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )$ and

$$\begin{aligned} \sum _{j=1}^{m_\tau } \Bigg [\sum _{i=1}^{r_\tau }F_{G}^l(\tau , p_{r_\tau (j-1)+i}) \Bigg ]\Bigg [\sum _{i=1}^{r_\tau }F_{Q}^l(\tau , p_{r_\tau (j-1)+i}) \Bigg ]. \end{aligned}$$

(3)

Given a vertex type $\tau $, $kF_{G^{k}}(\tau ) F_{{\widetilde{Q}}}(\tau )$ is a constant.

4.2 Optimal TOGGLE for label generalization

Given the vertex labels $l= (l_{\tau ,1}, l_{\tau ,2},\ldots , l_{\tau ,n_\tau })$ for the $\tau $-th vertex type, label generalization combines them into $m_\tau $ groups, each with $r_\tau $ labels. It is equivalent to finding a permutation $P=(l_{\tau ,p_1}, l_{\tau ,p_2}$, $\ldots $, $l_{\tau ,p_{n_\tau }})$ of l and then sequentially dividing the permutated labels into $m_\tau $ groups, each satisfying t-closeness. Formally,

$$ EMD_j \Big (l, \{l_{\tau ,p_{r_\tau (j-1)+i}} { | i\in [1,r_\tau ]}\} \Big ) \le t, \forall j\in [1,m_\tau ]. $$

Meanwhile, we want TOGGLE to minimize Expression (3) so that this label generalization can lead to a minimum search space. Combining the above, TOGGLE can be formulated as a combinational optimization problem with constraints.

$$\begin{aligned} {{{\underline{\varvec{TOGGLE}}}}}~~~~~~~~\mathop {{{\,\mathrm{argmin}\,}}}_{P} \sum _{j=1}^{m_\tau } \Bigg [\sum _{i=1}^{r_\tau }F_{G}^l(\tau , p_{r_\tau (j-1)+i}) \Bigg ]^2 \end{aligned}$$

subjects to

$$ EMD_j \Big (l, \{l_{\tau ,p_{r_\tau (j-1)+i}} { | i\in [1,r_\tau ]}\} \Big ) \le t, j\in [1,m_\tau ]. $$

Note that the objective function as above is a simplified version of Expression (3) by assuming query graphs and data graphs are independent and identically distributed. And the vertex type $\tau $ is omitted whenever it is fixed.

This optimization problem is challenging as there are a huge number of permutations, and in each permutation the Earth’s Mover Distance of all $m_\tau $ label groups need to be calculated. To address this, we reduce our optimization problem to the General Set Partitioning Problem (GSPP) [21]. Given a universe of n elements ($1,2,\ldots ,n$), there is a rule $\mathbb {Y}$ that generates feasible subsets $y_j$ of these elements, $\mathbb {J}=\{y_j\}$. Each subset $y_j$ has a cost $c(y_j)$, and GSPP divides all elements into subsets with the minimum cost. Formally,

Definition 6

(General Set Partitioning Problem) Let $\lambda _j \in \{0,1\}$ denote whether subset $y_j$ is a result subset. $y_{i,j}$=1 if $y_j$ contains element i, and 0 otherwise. Then, GSPP finds ${{\,\mathrm{argmin}\,}}\sum _{y_j \in \mathbb {J}}{c(y_j)\lambda _j}$ subjects to $\sum _{y_j \in \mathbb {J}}{y_{i,j}\lambda _j} = 1, i=1,2,\ldots ,n$.

In TOGGLE, the universe has n labels denoted as l. Each permutation P partitions the labels into label groups, and each has $r$ $=$ $n/m$ labels. The j-th label group $\{l_{p_{r(j-1)+i}} { | i\in [1,r]}\}$ is a subset $y_j$. The rule $\mathbb {Y}$ is t-closeness, $EMD_j(l,y_j)$ $\le $ t. The cost is the search space, i.e., $c(y_j)=(\sum _{i=1}^{n} F^l_G(y_{i,j}))^{2}$, where $F^l_G(y_{i,j})$ is the probability of label i in $y_j$. Specifically, if $y_{i,j}=1$, $F^l_G(y_{i,j}) = F^l_G(i)$, otherwise $F^l_G(y_{i,j}) =0$.

Therefore, the optimization problem of TOGGLE can be reduced to GSPP as follows:

$$\begin{aligned} {{{\underline{\varvec{TOGGLE}}}}~ under~\mathbf{GSPP} }~~~\min \sum _{y_j \in \mathbb {J}}{\lambda _j\underbrace{c(y_j)}_{(\sum _{i=1}^{n}F^l_G(y_{i,j}))^{2}} } \end{aligned}$$

(4)

subjects to $\sum _{y_j \in \mathbb {J}}{y_{i,j}\lambda _j} = 1, i=1,2,\ldots ,n$, and $\lambda _j \in \{0,1\}$. The original TOGGLE problem has $\frac{n! }{((n/m)!)^{m}}$ permutations, wherein the GSPP reduction effectively shrinks the size $|\mathbb {J}|$ to $\frac{n! }{(n/m)!(n-n/m)!}$.

4.3 Sub-optimal TOGGLE for label generalization

Since the asymptotic size of GSPP is still exponential in terms of n, finding the optimal partition is only feasible for small n.^{Footnote 5} In what follows, we propose a sub-optimal solution with a theoretical guarantee. In particular, we first transform the original GSPP optimization problem in Expression (4) into Linear Programming Master (LPM) problem by relaxing the integer constraint. Then, the initial solution for LPM problem is obtained, based on which an iterative process consisting of a master problem and subproblem is presented. The process terminates until the desirable solution is derived.

Algorithm 2 outlines the sub-optimal solution. Inspired by Column Generation [21], the algorithm first transforms the original GSPP optimization problem, denoted by $\mathbf{OP} $, into LPM problem (Line 1) by relaxing the integer constraint $\lambda _j \in \{0,1\}$ to $\lambda _j \in [0,1]$.

$$\begin{aligned} {{{\underline{\textit{\textbf{LPM Problem}}}}}}~~~~~~~~~~~~~~~~~~~\min \sum _{y_j \in \mathbb {J}}{c_j \lambda _j} \end{aligned}$$

subjects to $\sum _{y_j \in \mathbb {J}}{y_{i,j}\lambda _j} = 1, i=1,\ldots ,n$ and $\lambda _j \in [0,1]$, where $c_j = c(y_j) = (\sum _{i=1}^{n} F^l_G(y_{i,j}))^{2}$ and $y_j$ is a subset and is also called a column.

Since it is impossible to obtain all $|\mathbb {J}|$ feasible columns at once for LPM problem, the algorithm first generates $|\mathbb {J'}|$ ($\le |\mathbb {J}|$) feasible columns as an initial solution (Lines 2 to 8). With these columns, a master problem is formulated and solved by the Simplex method (Lines 12 to 13). The optimal solution is then used to formulate a dual subproblem, generate a new feasible column $y^*_j$ (Lines 14 to 15) and append it to the resulted columns (Line 12). The algorithm then iteratively solves the master problem and subproblem until there is no column with a desired reduced cost (Line 11). Finally, it applies the branch and bound algorithm [27] to obtain the integer solution to the partition of labels (Line 16).

In what follows, we elaborate them and prove the theoretical error bound of this approximation algorithm.

4.3.1 Initial solution (Lines 2 to 8)

Let $l=(l_1,l_2,\ldots ,l_n)$ be n labels sorted by their values and $(1/n,1/n,\ldots ,1/n)$ be their distribution masses. Each feasible column $y_j$ should contain n/m labels $(e_1,\ldots ,e_\alpha ,\ldots ,e_{n/m})$ with evenly distributed mass $(m/n,m/n,\ldots ,$ m/n). So it should generate m feasible columns “aligned” with l to calculate EMD. Here, “aligned” means $e_\alpha $ transports distribution mass to $l_i$. We further define an Alignment Group as a group of consecutive $l_i$ in l. If both l and $e_\alpha $ are sorted by their values, Lemma 2 shows their alignment relationship.

From Lemma 2, we can derive that if we select an element $e_\alpha $ for $y_j$ from one of the positions $[(\alpha -1)m+1,\alpha m]$ of l, the ground distance between $e_\alpha $ and alignment group $\alpha $ is at most $\frac{n^2-n(n/m) }{2(n/m)^2(n-1)}$. Based on this lemma, we propose the following heuristic to obtain the initial solution. When $t < \frac{m-1}{2(n-1)}$, we design a procedure ModifiedBasSol to find the exact optimal partition that satisfies t-closeness by exhaustive search (Line 7). The procedure is similar to the baseline solution in Algorithm 1, but it uses Lemma 2 to calculate EMD. When $t \ge \frac{m-1}{2(n-1)}$, we group $\{l_j$, $l_{j+m}$, $\ldots $, $l_{j+(n/m-1)m}\}$ into a label group $y'_j$ and further generate $y_j$ based on $y'_j$ by setting $y_{i,j}=1$ if $l_i\in y'_j$, otherwise $y_{i,j}=0$ (Lines 2 to 6).

4.3.2 Master problem (Lines 12 to 13)

With the generated $|\mathbb {J'}|$ columns, we further reformulate the LPM problem to a Restricted Linear Programming Master (RLPM) problem and obtain an optimal solution to it and its dual problem (Lines 12 and 13).

$$\begin{aligned} {{{\underline{\varvec{RLPM Problem}}}}}\quad \min \sum _{y_j \in \mathbb {J'}}{c_j \lambda _j} \end{aligned}$$

subjects to $\sum _{y_j \in \mathbb {J'}}{y_{i,j}\lambda _j} = 1, i=1,\ldots ,n$ and $\lambda _j \in [0,1]$, where $c_j = c(y_j) = (\sum _{i=1}^{n} F^l_G(y_{i,j}))^{2}$ and $y_j$ is a column generated from the rule $\mathbb {Y}$. Note that $y_j$ is limited to $\mathbb {J'}$ instead of $\mathbb {J}$.

RLPM problem is a simple linear programming problem with only $|\mathbb {J'}|$ feasible columns, so it can be easily solved by employing the Simplex method (Line 13).

4.3.3 Subproblem (Lines 14 to 15)

Let $\mu $ denote the dual optimal solution to the RLPM problem. Subproblem is to find a new feasible column $y^*_j$ that minimizes the reduced cost $\{c(y_j) - \mu y_j\}$ and meets the rule $\mathbb {Y}$, or formally

$$\begin{aligned} y^*_j = \mathop {{{\,\mathrm{argmax}\,}}}\limits _{y_j} \{\mu y_j - c(y_j) \} \end{aligned}$$

subjects to $EMD(l,y_j)\le t$, where $c(y_j)$ $=$ $(\sum _{i=1}^{n}F^l_G(y_{i,j}))^{2}$.

The subproblem is obviously NP-hard. We propose a sub-optimal solution by reducing the subproblem to a $0-1$ Quadratic Knapsack Problem (QKP) (Line 14):

$$\begin{aligned} {{{\underline{\varvec{0 -- 1 QKP}}}}}~~~~~~~~~~~~~~~~~~~\max \{\mu y_j - c(y_j)\} \end{aligned}$$

subjects to

$$\begin{aligned} {{{\underline{\varvec{QKP Constraints}}}}}~~~\sum _{\beta =1}^{m}y_{(\alpha -1)m+\beta ,j}=1,~~\alpha =1,\ldots ,\frac{n}{m} \end{aligned}$$

The $0-1$ QKP can be solved by CPLEX optimizer [28] as a quadratic programming problem (Line 15).

4.3.4 Analysis on TOGGLE

In this subsection, we give a detailed theoretical analysis for correctness, performance guarantee and complexity.

Correctness analysis We can theoretically prove that the generalized labels generated by TOGGLE satisfy t-closeness. To this end, we need to prove that any generalized label in the initial solution (i.e., $y_j$) and that in the subproblem (i.e., $y^*_j$) achieve t-closeness.

Theorem 2

When $t \ge \frac{m-1}{2(n-1)}$, the column $y_j$ or $y^*_j$ generated as above satisfies t-closeness, where n and m are number of labels and label groups, respectively.

To sketch this proof, we first introduce two lemmas. In particular, Lemma 1 paves the way for the proof of Lemma 2 and the latter proves the range of the ground distance. Their proofs are given in Appendix A.

Lemma 1

The $\alpha $-th element $e_\alpha $ of $y_j$ is aligned with $\alpha $-th alignment group of l, and $e_\alpha $ needs to transport 1/n distribution mass to each element in alignment group $\alpha $.

Lemma 2

The ground distance between $e_\alpha $ and the alignment group $\alpha $ (denoted by $g_\alpha $) is

$$\begin{aligned} Dist(e_\alpha ,g_\alpha )=\frac{1}{n-1}\sum _{\beta =1}^{m}\Big | i-(\alpha -1)m - \beta \Big |. \end{aligned}$$

For $\forall i \in [(\alpha -1)m+1, (\alpha -1)m+m ]$, if m is odd, $Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - \frac{n}{m}^2}{4(n-1)\frac{n}{m}^2}$, $\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]$. Otherwise, $Dist(e_\alpha $, $g_\alpha ) \in [\frac{n^2}{4(n-1)\frac{n}{m}^2}$, $\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]$.

Performance guarantee We can also prove that the (sub-)optimal TOGGLE has a theoretical guarantee. In particular, if $t < \frac{m-1}{2(n-1)}$, (sub-)optimal TOGGLE can find the exact solution. Otherwise, it can provide a $(1+m/5)$-near optimal solution.

Theorem 3

When $t \ge \frac{m-1}{2(n-1)}$, sub-optimal TOGGLE provide a $(1+\epsilon )$-approximate solution to the original problem where $\epsilon $ is approximately m/5.

To sketch this proof, we introduce two lemmas. The first gives an approximation bound to the initial solution, and the second gives an approximation bound to the subproblem.

Lemma 3

When $t \ge \frac{m-1}{2(n-1)}$, the initial solution has an approximation ratio $\big (1+\frac{m^2-m+2}{(m+1)(ln(n)+0.5772+1/(2n))}\big )$ to the optimal solution.

Lemma 4

When $t \ge \frac{m-1}{2(n-1)}$, the reduction to $0-1$ QKP provides a $\frac{4}{(ln(n)+0.5772+1/(2n))m}$-approximate solution to the original subproblem.

Complexity analysis We can also prove that the space complexity is linear to number of labels n, and worst case time complexity is directly proportional to $\frac{n! }{(n/m)!(n-n/m)!}$.

In particular, the space complexity is O(n). As for time complexity, if $t \ge \frac{m-1}{2(n-1)}$, let $\omega $, $t_1$ and $t_2$ denote the number of iterations, the time cost for one Simplex invocation and the time cost for one QKP solver, respectively. Then, the total time complexity of TOGGLE is $O(\omega (t_1+t_2))$. Otherwise, the time complexity is $\frac{n!t_3 }{(n/m)!(n-n/m)!}$, where $t_3$ is the time cost to calculate EMD using Lemma 2. Note that the time complexity of (sub-)optimal TOGGLE is far less that of the baseline solution (i.e., $O(\frac{n!t_1}{((n/m)!)^{m}})$, see Section 3), which is exponential to n. Moreover, the value of $\frac{n!}{(n/m)!(n-n/m)!}$ is relatively small as n is the number of labels in a single attribute whose value is small in real-world dataset (see Section 6).

5 PGP subgraph matching algorithm

Although subgraph matching has been intensively studied [1, 29, 30], the only algorithm that can work on an outsourced succinct graph ${\widetilde{G}}^k$ is the star-based algorithm [3] described in Section 3. However, this algorithm needs to decompose the original query into multiple sub-queries, which is significantly inefficient. In this section, we propose the partial-graph-based subgraph processing algorithm PGP that no longer needs query decomposition. In the outsourced graph ${\widetilde{G}}^k$, boundary nodes are those “1-hop neighboring nodes” and interior nodes are the others. The challenge in processing query on ${\widetilde{G}}^k$ is to retrieve those neighbors for boundary nodes whose neighbors are partially in the succinct graph. To address this problem, PGP exploits the symmetry of the outsourced graph to retrieve matchings for those nodes. Its pseudo-codes are shown in Algorithm 3. Note that T and $T_p$ denote the final matched subgraphs and the partial embedding of a single complete matching, respectively.

In Algorithm 3, it first initializes T and $T_p$ with empty sets and marks all data nodes in ${\widetilde{G}}^k$ as unvisited (Line 2). Then, all matching candidates are generated for each query node of ${\widetilde{Q}}$ by comparing their labels, degrees and neighbors’ labels with those of interior nodes in ${\widetilde{G}}^k$, and expanding them to $G^k$ (Line 3). Specifically, for each query node u of ${\widetilde{Q}}$, its interior-node candidate v is selected if all three conditions are met: (1) v contains the same vertex type and label group as u, (2) u’s degree is no more than that of v, and (3) labels of u’s neighbors are subset of those of v’s neighbors. Next, the algorithm expands the candidates to the other blocks of $G^k$ and forms the candidate set. After that, it generates a sorted list $L_q$ to store the access order for each query node of ${\widetilde{Q}}$ according to its selectivity, the ratio between the number of candidate nodes and its degree (Line 4). Finally, it triggers the subroutine PartialSQ to retrieve all matched subgraphs.

This subroutine first determines whether the partial embedding $T_p$ is a complete matching by comparing the size of $T_p$ with that of ${\widetilde{Q}}$. If so, $T_p$ is inserted to T (Lines 7 to 8). Then, it recursively processes the query (Lines 10 to 17) as follows. It finds the query node u in depth d from $L_q$ with its farther node $u_f$ and refines the candidates for matching with u by calling subroutine refineV (Lines 10 to 11). refineV confines $u's$ candidate v to the neighbors of $v_f$ which is already matched with $u_f$. Specifically, if $v_f$ is an interior node, v is selected as the candidate vertex when v is one of neighbors of $v_f$. If $v_f$ is a boundary node, there are three steps: (1) finding the symmetry vertex $v_f'$ from interior nodes for $v_f$, (2) retrieving all neighbors of $v_f'$; and (3) determining if v can be mapped with one of those neighbors by automorphic function. (If so, v is a candidate vertex of u.) Then, for each candidate $v \in C_s$, it determines whether v can be matched with u in current partial embedding $T_p$ by judging whether all paired vertices of u’s neighbors in $T_p$ are also the neighbors of v (Lines 12 to 13). If v can be matched, the paired u and v will be inserted into $T_p$ (Line 14). Then, PartialSQ is recursively called to access the next vertex in ${\widetilde{Q}}$ (Line 15).

Utility analysis As for the utility, two metrics should be considered. One is recall, i.e., the fraction of correct results among all true results for the query. Fortunately, all matchings retrieved by PGP can cover all true matchings of Q over G, as (k, t)-privacy model does not drop any edge or node. As such, the recall is always 100%. The other is precision, i.e., the fraction of correct results among the retrieved matchings. The precision of all matchings retrieved by PGP is the same as that of the baseline solution.

Complexity analysis The space complexity is O(|T|) where |T| is the number of matchings. The time complexity is $O(|C_s|^{|V({\widetilde{Q}})|})$ where $V({\widetilde{Q}})$ denotes the set of vertices of query ${\widetilde{Q}}$.

6 Experimental results

6.1 Experimental setup

We evaluate the performance of proposed TOGGLE (denoted by TOG) and PGP algorithms against the baseline solution for label generalization (denoted by BAS) and star-based subgraph processing (denoted by SGP) described in Section 3 (codes provided by courtesy of the authors of [3]). To evaluate the privacy and utility, we further compare our methods with state-of-the-art techniques. All algorithms are written in C++ except for TOGGLE, which is implemented in MATLAB. The client is a desktop computer with Intel(R) Core(TM) i5-6600 CPU machine and 16GB RAM running Windows 10 Pro. The cloud server is a Windows Server 2016 Datacenter with 4 CPU cores and 64GB RAM.

Datasets We use three real graph datasets with increasing sizes: Web-NotreDame, DBpedia and UK-2002. Their statistics are given in Table 2 where $N_t$, $N_a$ and $N_l$ indicate number of types, attributes and labels, respectively. In particular, the maximum numbers of labels in a single attribute are 200, 985 and 1000, respectively. Query graphs are generated from the data graph by the random walk scheme [31]. For each dataset, 1000 query graphs with sizes in the range of [4-30] are randomly generated.

Table 2 Datasets

Full size table

Table 3 Parameter settings for experiments

Full size table

Performance metrics In each experiment, we measure the Time Cost, Model Cost and Approximation ratio. Time Cost is the clock time to complete the experiment. Model Cost is the value of the objective function (Expression (4)) for a label generalization algorithm. Approximation ratio is the ratio of Model Cost of a sub-optimal solution to that of the optimal solution.

Parameter settings Unless otherwise stated, we use the default parameter values in Table 3. We choose 4 values $t_i,i\in \{1,2,3,4\}$ for t in order to cover all possible cases of approximation ratio as stated in the proof of Theorem 2.

6.2 Performance of sub-optimal TOGGLE

We first compare the performance of TOGGLE with that of BAS under default settings. As we will see, since BAS is very time-consuming, we use only 12 labels in our default settings. Figure 5a shows the time cost on UK-2002 with 12 labels (denoted by UK-2002(12)). From this figure, we find that the time cost of TOG is far less than that of BAS regardless of t. In addition, we observe a dramatic increase in time cost of BAS as t increases from $t_1$ to $t_4$. This is because the number of label groups satisfying t-closeness increases when t is relaxed from $t_1$ to $t_4$. From Fig. 5b, we observe that the model cost of TOG is the same as that of BAS, which indicates that approximation ratio of TOG on UK-2002(12) can reach 1. Figure 6 shows similar results on the DBpedia(12) dataset. This shows that TOGGLE not only has a theoretical approximation bound, but is also close to the optimal in practice. In what follows, we further evaluate its performance by varying group size and number of labels.

Impacts of group size Figure 7 shows the time cost on UK-2002(12) by varying the group size. In this figure, we use $TOG_1$ to denote the time cost of TOG when $t=t_1$; for $t=t_2$, $t_3$ or $t_4$, we use TOG as it acts the same. Similarly, $BAS_i$ denotes the time cost of BAS when $t=t_i$. We observe that the time cost of TOG is far less than that of BAS especially for $t=t_3$ and $t_4$. When the group size increases, the time of BAS decreases because its search space is proportional to the number of different permutations. On the other hand, TOG fluctuates within a narrow range within 1 second.

Figure 9 presents the corresponding model costs on UK-2002(12). There is one missing result at $(6,t_1)$ because there is no feasible label group in this setting. It can be seen from this figure that increasing group size will lead to the increase in model cost. Nonetheless, the model cost of TOG is almost identical to that of $BAS_i$.

We also plot the frequency distribution of approximation ratio (denoted by Freq) on various group sizes and t setting for both datasets in Fig. 8. Overall even the worst case leads to an approximation ratio lower than 1.1. As for the best case, more than $50\%$ of cases can achieve a ratio equal to 1.

Impacts of label size We vary the number of labels from 10 to 1000 (i.e., the maximum number of labels in a single attribute) on UK-2002 and compare their results. As shown in Fig. 10, the time cost of BAS is rising exponentially with respect to the number of labels, and it already becomes prohibitively costly (e.g., more than 1.5 days) for only 16 labels at $t \ge t_3$. On the other hand, TOG is less sensitive to the increase in number of labels. In fact, the time cost of TOG reaches its peak at 400 and slowly decreases afterward, which indicates that TOG can easily scale to even more labels. Figure 12 presents the corresponding model costs on UK-2002. It can be seen from this figure that the model cost of TOG is almost equivalent to that of BAS. This shows TOG can achieve robust and desirable performance irrespective of label size.

As for the approximation ratio, we plot the frequency distribution of approximation ratios (denoted by Freq) of TOG for both UK-2002 and DBpedia in Fig. 11. We find the approximation ratios are no more than 1.1 and $80\%$ of them are exactly 1.

To sum up, our privacy-preserving label generalization algorithm TOGGLE achieves high efficiency, effectiveness and scalability.

6.3 Performance of PGP algorithm

We compare our PGP algorithm with SGP, the star-based subgraph processing algorithm. Figure 13a presents the time cost for the Web-NotreDame dataset. We observe that as the query size increases, the time cost of PGP grows much slower than that of SGP. This is because the number of star matchings undergoes a huge rise as shown in Table 4. We then fix the query size to 6 and vary the value of k from 2 to 6. As shown in Fig. 13b, the time cost increases with k, because a larger k introduces more redundant edges to the k-automorphic data graph, which leads to more matchings for a single query. Nonetheless, PGP takes less time cost than SGP, since the latter incurs significant time to compute natural join for a large number of matched stars. Figure 13(c) through (f) shows similar results for the other two datasets. As such, we conclude that PGP always outperforms SGP under various k and query sizes. Furthermore, the gain of PGP becomes more evident for larger k or query sizes.

Table 4 # of Star matchings on Web-NotreDame (k=2)

Full size table

Table 5 Success ratio under similarity attack and majority attack

Full size table

6.4 Performance of privacy–utility

We further compare our work with some existing techniques in terms of privacy and utility on Web-NotreDame.

Power of privacy protection We compare anonymized graphs produced by our work (TOG) and other classical anonymization methods [3, 13, 15] under different attacks such as degree attack, subgraph attack and similarity attack. For any vertex degree d of the anonymized graph produced by our work, we report its frequency of vertices with degree d. As depicted in Fig. 14, the minimal degree frequency is k, which indicates that our method can guarantee privacy under degree attack. We further test our method under subgraph attacks. To this end, we randomly extract some subgraphs from the original graph as query graphs and retrieve the matchings for each query to determine if the number of matching is at least k. As shown in Fig. 15, the number of matchings for each subgraph query over the anonymized graph produced by our work is at least k. This indicates that our method can make the anonymized graph defend against subgraph attack. Similarly, we can also find that Deg [13] and Nei [15] suffer from subgraph attack. The reason is that these two algorithms only consider a single type of attack, i.e., degree-attack or 1-neighbor-graph attack, respectively. We further compare TOG with BAS [3] under similarity attack and majority attack (i.e., simply predict the target node’s label based on the majority label of the neighbors). Table 5 presents the success ratios under similarity attack and majority attack (denoted by SucRatio(SimAttack) and SucRatio(MajAttack)), which refer to the ratios of that one attacker can infer the sensitive label using the attack techniques. We find that the success ratio of the similarity attack on anonymized graph produced by TOG is 0, while that on anonymized graph produced by BAS is close to 50% . We can also find that SucRatio(MajAttack) of both TOG and BAS is close to 1.0%, which is equal to the probability that randomly selects a label (from 100 labels) as the label of a target node. Therefore, we can conclude that TOG can guarantee privacy under degree attack, subgraph attack and similarity attack.

Table 6 Success ratio under similarity attack and majority attack on ERNet

Full size table

In addition, we further evaluate our methods using synthetic data sets and compare them with the existing state-of-the-art techniques. We use a software called Pajek ( http://vlado.fmf.unilj.si/pub/networks/pajek/) to generate two kinds of random graphs, Erdos–Renyi Graph and Scale-Free Graph.

1.
Erdos–Renyi Graph (denoted by ERNet): This graph can be generated by a random graph model, which defines a random graph as N vertices connected by M edges that are chosen randomly from the $N(N-1)$ possible edges. Pajek can generate it by setting the number of vertices N and average degree d. In our experiments, we set $N=1000$ and $d=10$.
2.
Scale-Free Graph (denoted by SFNet): A scale-free network is a network whose degree distribution follows or asymptotically follows a power law. In our experiments, we set the number of vertices to be 1049.

As depicted in Figs. 16 and 17, for any vertex degree d of the anonymized graph produced by our method, its frequency of vertices with degree d is at least k. It indicates that our method can defend against degree attack. We further test our method under subgraph attacks by retrieving the matchings for each query. As presented in Figs. 18 and 19, the number of matchings for each subgraph query over the anonymized graph produced by our work is at least k. This indicates that our method can make the anonymized graph defend against subgraph attack. Similarly, we can also find that both Deg [13] and Nei [15] suffer from subgraph attack. We further compare TOG with BAS [3] under similarity attack and majority attack on our synthetic data set ERNet. Table 6 presents the success ratios under similarity attack and majority attack (denoted by SucRatio(SimAttack) and SucRatio(MajAttack)). We find that the success ratio of the similarity attack on anonymized graph produced by TOG is 0, while that on anonymized graph produced by BAS is close to 50% . Observe that SucRatio(MajAttack) of both TOG and BAS is close to 1.0%, which is equal to the probability that randomly selects a label (from 100 labels) as the label of a target node. Therefore, we can conclude that TOG can guarantee privacy under degree attack, subgraph attack and similarity attack.

Table 7 Running time and storage space (PGP V.S. SQ)

Full size table

Utility Evaluation. To evaluate the utility of our method (PGP), we further compare it with classical subgraph matching methods over the entire graph. Since both of ${\widetilde{G}}^k$ and Alignment Vertex Table (AVT) are outsourced to the cloud during pre-processing, the cloud actually knows all the information of the original $k-$automorphic graph $G^k$. Hence, most subgraph matching algorithms [31,32,33] can be extended to work on the outsourced graph ${\widetilde{G}}^k$ by following steps: (1) Graph Extensions. Given a vertex p of ${\widetilde{G}}^k$, its symmetric vertices P on other blocks can be found by retrieving from Alignment Vertex Table (AVT). Similarly, the symmetric vertices Q of another vertex q of ${\widetilde{G}}^k$ can be easily found. If there is an edge between p and q, an edge should be inserted between vertices pair of $p'$ and $q'$ where $p' \in P$ and $q' \in Q$. (2) Index Construction. Then, the indices for degree and neighborhood signature filtering for candidates as in previous works such as [26] can be constructed. (3) Subgraph Matching. Given the extended graph of ${\widetilde{G}}^k$ and the indices, classical subgraph matching algorithms such as [31,32,33] can be applied to retrieve subgraph matchings. The results of PGP and that of the classical subgraph matching algorithm [31] (denoted by SQ) on Web-NotreDame are shown in Table 7. Under the same privacy (the same k), we can find that PGP outperforms SQ w.r.t both running time and storage space. In particular, the time and space savings are up to tenfold and fivefold, respectively.

Table 8 Recall

Full size table

We further evaluate the recall and precision of PGP, SQ and SGP. Recall that recall is the fraction of correct results among all true results for the query and precision is the fraction of correct results among the retrieved matchings. As reported in Table 8, recall is always 100% since none of edges and nodes are dropped in the generalization process. As for precision, these methods obtain the same performance due to that all of them can be deemed as subgraph processing on a $k-$automorphic graph $G^k$ (Fig. 20). In addition, the precision decreases as the value of k increases, since more dummy edges will be introduced as the value of k increases. The observations further justify the utility analysis in Section 5 (Utility Analysis). In general, more stringent privacy will be obtained as the value of k increases. Meanwhile, more running time and storage space are taken for subgraph processing, and worse precision is obtained.

Table 9 Summary on related work

Full size table

7 Related work

The most germane to this research is privacy-preserving graph data publication and anonymization, and privacy-preserving graph query processing in the cloud, which is summarized in Table 9.

Privacy-preserving graph data publication and anonymization Many structural privacy-preserving mechanisms have been developed [17,18,19] by exploiting the symmetry of the graph data to guard against various attacks such as degree attack, subgraph attack and hub-fingerprint attack [13, 15, 16]. In particular, the k-automorphism method by Zou et al. constructs $(k-1)$ symmetric vertices for each existing vertex and is claimed to defend against any graph structural attacks [17]. When sensitive labels are considered, the k-anonymity and $\ell $-diversity have been adopted for graphs [3, 12]. In addition, they can also preserve the structure privacy. However, [12] can only defend against some specific structural attack (e.g., degree attack) or label attack on simple labels. They are both vulnerable to advanced attacks, such as similarity attack. Differential privacy (DP) [11] as a more stringent privacy model has been widely studied to protect against the privacy disclosure of statistical databases. It ensures that query results on a dataset are insensitive to the change of a single record. Owing to its unique strengths, it has been applied to graph data analysis, which can be grouped into two categories: edge-DP [34,35,36,37]- and node-DP [39,40,41,42]- based methods. In edge-DP (resp. node-DP), two graphs are neighboring if they differ on a single edge (resp. node). In particular, [34] publishes degree sequence of a graph under edge-DP by adopting the Laplace mechanism where sensitive is 2. [35, 36] have also considered publishing the degree sequence or distribution by extending the technique to synthetic generation. [37] proposes a framework for graph metric estimation with local differential privacy. When the setting moves to node-DP, the sensitivity of degree distribution becomes $2(|V|-1)$ since removing a node may have effect on other $|V|-1$ nodes. Hence, [42] explores the projection approach to reduce the sensitivity when publishing the degree distribution of a graph under node-DP, while [41] adopts a truncation approach to reduce the sensitivity. Instead of publishing degree sequence or distribution, differentially private subgraph counting queries under node-DP have been studied [40, 42]. For example, [42] develops a novel method for differentially private triangle counting in large graphs. Recent efforts have investigated the problem of publishing statistics of attributed graphs or spatiotemporal graphs [38, 43, 44]. The neighboring data in [44] are the records that differ in the presence of a single edge or in the attribute vector associated with a single node. In contrast, neighboring data in [43] are defined on infinite series data. As the perturbation is required to satisfy differential privacy, noises such as Laplace noise [34, 43] need to be injected into graphs or their statistics. This makes DP-based techniques insufficient or even infeasible in subgraph matching where exact matchings are desirable.

There are a lot of other graph data anonymization and publication techniques [16, 45,46,47]. For example, to protect a graph from link re-identification, Zheleva & Getoor propose five different privacy preservation strategies, which vary in terms of the amount of data removed and the amount of privacy preserved [45], while Campan & Truta tries to mask the graph data according to the k-anonymity model, in terms of both nodes’ attributes and nodes’-associated structural information [46]. However, they cannot apply to our case since (1) some of them can not protect both structural and label privacy (e.g., [45] can only protect the graph from link re-identification). (2) some of them need to delete edges from the original data graph [45, 46]. Hence, they are infeasible in subgraph matching where exact matchings are desirable.

Privacy-preserving graph query processing in cloud There are a lot of privacy-preserving methods or frameworks for diverse queries. In particular, Fan et al. [48] propose an asymmetric structure-preserving subgraph query processing method which is the first practical private approach for subgraph query services, where the data graph is publicly known and the query structure/topology is kept secret. [50] develops specific schemes for shortest path queries, which achieve much better performance and scalability with a strong privacy property in practical scenarios. [49] proposes a method to efficiently compute the shortest distance in large outsourced graphs without compromising their sensitive information. Ma et al. [51] investigate the shortest path sketch by defining a top-k critical vertices query on the shortest path. A novel graph encryption scheme that enables approximate constrained shortest distance querying is studied in [52]. However, all of them are privacy-preserving schemes for (approximate) shortest path queries or top-k critical vertices query on shortest path, which do not apply to our case where the goal is to find subgraph matches on a single large attributed graph. [4] presents a method to answer the subgraph matching over a set of encrypted small graphs instead of a single large graph. [53] develops a novel k-decomposition algorithm and a new information loss matrix designed for utility measurement in massively large graph datasets. However, it cannot protect label privacy since only structure privacy is considered. The most germane to this research is [3, 59] which develop privacy-preserving sub-graph matching methods on large attributed graphs in cloud. Unfortunately, they are vulnerable to similarity attack and suffer fromlowefficiency. Other techniques on other privacy-preserving graph queries such as reachability query [60] and kNN query [61]. Nevertheless, they cannot be adapted to subgraph matching on a single large attributed graph.

8 Conclusion

In this paper, we propose a graph label generalization algorithm and an efficient subgraph matching algorithm in cloud with t-closeness and k-automorphism privacy. The former achieves a theoretically guaranteed approximation ratio of $(1+\epsilon )$ where $\epsilon $ is approximately 0.2 times the number of label groups. The latter exploits the symmetry of the generalized graph and limits the search scope to a localized region. As for future work, we plan to extend this solution framework to a wide range of classic graph queries, such as maximal clique search and best region search.

Notes

https://aws.amazon.com/compliance/hipaa-compliance/.
https://www.oracle.com/database/graph/.
One can argue that differential privacy [11] is more stringent than t-closeness as the former is defined regardless of the underlying dataset or a priori knowledge. However, it is infeasible in subgraph matching where exact matchings are desirable.
We will directly use the notation $(v_1,v_2,\ldots v_i,\ldots v_j\ldots v_n)$ to denote the uniform distribution where each value is equally likely. For sorted numerical values $(v_1,v_2,\ldots v_i,\ldots v_j\ldots v_n)$, the ground distance of $v_i$ and $v_j$ is $\frac{|i-j|}{n-1}$ [10].
When n is small, we can enumerate all feasible subsets and relax the constraint $y_{i,j} \in \{0,1\}$ to $y_{i,j} \in [0,1]$. Then, we apply the Simplex method to solve this linear programming problem. Finally, we employ the Branch-and-Bound method to obtain the integer solution [27].

References

Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: SIGMOD, pp. 1199–1214 (2016)
Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: Graphlab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014)
Chang, Z., Zou, L., Li, F.: Privacy preserving subgraph matching on large graphs in cloud. In: SIGMOD, pp. 199–213 (2016)
Cao, N., Yang, Z., Wang, C., Ren, K., Lou, W.: Privacy-preserving query over encrypted graph-structured data in cloud computing. In: ICDCS, pp. 393–402 (2011)
Hu, H., Xu, J., Chen, Q. et al.: Authenticating location-based services without compromising location privacy. In: SIGMOD, pp. 301–312 (2012)
Xu, J., Yi, P., Choi, B. et al.: Privacy-preserving reachability query services for massive networks. In: CIKM, pp. 145–154 (2016)
Available at: https://www.oracle.com/a/tech/docs/sg-oow2019-using-graph-analysis-and-fraud-detection-in-fintech-industry.pdf
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(05), 557–570 (2002)
Article MathSciNet Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: ICDE, pp. 24 (2006)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: ICDE, pp. 106–115 (2007)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC, pp. 265–284 (2006)
Yuan, M., Chen, L., Philip, S.Y., Yu, T.: Protecting sensitive labels in social network data anonymization. TKDE 25(3), 633–647 (2013)
Google Scholar
Liu, K., Terzi, E.: Towards identity anonymization on graphs. In: SIGMOD, pp. 93–106 (2008)
Tai, C.-H., Tseng, P.-J., Philip, S.Y., Chen, M.-S.: Identity protection in sequential releases of dynamic networks. TKDE 26(3), 635–651 (2014)
Google Scholar
Zhou, B., Pei, J.: Preserving privacy in social networks against neighborhood attacks. In: ICDE, pp. 506–515 (2008)
Hay, M., Miklau, G., Jensen, D., Towsley, D., Weis, P.: Resisting structural re-identification in anonymized social networks. PVLDB 1(1), 102–114 (2008)
Google Scholar
Zou, L., Chen, L., Özsu, M.T.: K-automorphism: a general framework for privacy preserving network publication. PVLDB 2(1), 946–957 (2009)
Google Scholar
Cheng, J., Fu, A.W.-c., Liu, J.: K-isomorphism: privacy preserving network publication against structural attacks. In: SIGMOD, pp. 459–470 (2010)
Wu, W., Xiao, Y., Wang, W., He, Z., Wang, Z.: K-symmetry model for identity anonymization in social networks. In: EDBT, pp. 111–122 (2010)
Gao, J., et al.: A privacy-preserving framework for subgraph pattern matching in cloud. In: DASFAA, pp. 307–322 (2018)
Barnhart, C., Johnson, E.L., Nemhauser, G.L., Savelsbergh, M.W., Vance, P.H.: Branch-and-price: column generation for solving huge integer programs. Oper. Res. 46(3), 316–329 (1998)
Article MathSciNet Google Scholar
Li, X.-Y., Zhang, C., Jung, T., Qian, J., Chen, L.: Graph-based privacy-preserving data publication. In: INFOCOM, pp. 1–9 (2016)
Hajian, S., Domingo-Ferrer, J., Farràs, O.: Generalization-based privacy preservation and discrimination prevention in data publishing and mining. DMKD 28(5–6), 1158–1188 (2014)
MathSciNet MATH Google Scholar
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV 40(2), 99–121 (2000)
Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: ICS, p. 29 (1995)
He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)
Article MathSciNet Google Scholar
ILOG, I.: Cplex optimizer. https://www.ibm.com/cn-zh/marketplace/ibm-ilog-cplex (2012)
Du, B., Zhang, S., Cao, N., Tong, H.: First: fast interactive attributed subgraph matching. In: SIGKDD. ACM, pp. 1447–1456 (2017)
Qiao, M., Zhang, H., Cheng, H.: Subgraph matching: on compression and computation. PVLDB 11(2), 176–188 (2017)
Google Scholar
Yang, Z., Fu, A.W.-C., Liu, R.: Diversified top-k subgraph querying in a large graph. In: SIGMOD, pp. 1167–1182 (2016)
Han, W.-S., Lee, J., Lee, J.-H.: Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. In: SIGMOD, pp. 337–348 (2013)
Zhu, G., Lin, X., Zhu, K., Zhang, W., Yu, J.X.: Treespan: efficiently computing similarity all-matching. In: SIGMOD, pp. 529–540 (2012)
Hay, M., Li, C., Miklau, G., Jensen, D.: Accurate estimation of the degree distribution of private networks. In: ICDM, pp. 169–178 (2009)
Karwa, V., Raskhodnikova, S., Smith, A., Yaroslavtsev, G.: Private analysis of graph structure. PVLDB 4(11), 1146–1157 (2011)
MATH Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Private release of graph statistics using ladder functions. In: SIGMOD, pp. 731–745 (2015)
Ye, Q., Hu, H., Au, M.H., Meng, X., Xiao, X.: LF-GDPR:Graph metric estimation with local differential privacy. In: TKDE (2020). https://doi.org/10.1109/TKDE.2020.3047124
Jiang, H., Pei, J., Yu, D. et al.: Applications of differential privacy in social network analysis: a survey. TKDE (2021)
Ding, X., Sheng, S., Zhou, S. et al.: Differentially Private Triangle Counting in Large Graphs. TKDE (2021)
Chen, S., Zhou, S.: Recursive mechanism: Towards node differential privacy and unrestricted joins. In: SIGMOD, pp. 653–664 (2013)
Kasiviswanathan, S.P., Nissim, K., Raskhodnikova, S., Smith, A.: Analyzing graphs with node differential privacy. In: TCC, pp. 457–476 (2013)
Day, W.Y., Li, N., Lyu, M.: Publishing graph degree distribution with node differential privacy. In: SIGMOD, pp. 123–138 (2016)
Wang, Q., Zhang, Y., Lu, X., et al.: Real-time and spatio-temporal crowd-sourced social network data publishing with differential privacy. TDSC 15(4), 591–606 (2016)
Google Scholar
Jorgensen, Z., Yu, T., Cormode, G.: Publishing attributed social graphs with formal privacy guarantees. In: SIGMOD, pp. 107–122 (2016)
Zheleva, E., Getoor, L.: Preserving the privacy of sensitive relationships in graph data. In: International Workshop on Privacy, Security, and Trust in KDD, pp. 153–171 (2007)
Campan, A., Truta, T.M.: Data and structural k-anonymity in social networks. In: International Workshop on Privacy, Security, and Trust in KDD, pp. 33–54 (2008)
Bhagat, S., Cormode, G., Krishnamurthy, B., Srivastava, D.: Class-based graph anonymization for social network data. PVLDB 2(1), 766–777 (2009)
Google Scholar
Fan, Z., Choi, B., Xu, J., Bhowmick, S.S.: Asymmetric structure-preserving subgraph queries for large graphs. In: ICDE, pp. 339–350 (2015)
Gao, J., Yu, J.X., Jin, R., Zhou, J., Wang, T., Yang, D.: Neighborhood-privacy protected shortest distance computing in cloud. In: SIGMOD, pp. 409–420 (2011)
Xie, D., Li, G., Yao, B., Wei, X., Xiao, X., Gao, Y., Guo, M.: Practical private shortest path computation based on oblivious storage. In: ICDE, pp. 361–372 (2016)
Ma, J., Yao, B., Gao, X., et al.: Top-k critical vertices query on shortest path. TKDE 30(10), 1999–2012 (2018)
Google Scholar
Shen, M., Ma, B., Zhu, L., et al.: Cloud-based approximate constrained shortest distance queries over encrypted graphs with privacy protection. TIFS 13(4), 940–953 (2017)
Google Scholar
Ding, X., Wang, C., Choo, K.K.R., et al.: A novel privacy preserving framework for large scale graph data publishing. TKDE 33(2), 331–343 (2019)
Google Scholar
Jiang, J., Yi, P., Choi, B., et al.: Privacy-preserving reachability query services for massive networks. In: CIKM, pp. 145–154 (2016)
Yang, S., Tang, S., Zhang, X.: Privacy-preserving k nearest neighbor query with authentication on road networks. JPDC 134, 25–36 (2019)
Google Scholar
Liang, H., Yuan, H.: On the complexity of t-closeness anonymization and related problems. In: DASFAA, pp. 331–345 (2013)
Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB 1(1), 364–375 (2008)
Google Scholar
Garey, M.R., Johnson, D.S.: Computers and intractability. Freeman San Francisco, vol. 174 (1979)
Schrenk, S., Finke, G., Cung, V.-D.: Two classical transportation problems revisited: pure constant fixed charges and the paradox. Math. Comput. Model. 54(9–10), 2306–2315 (2011)
Article MathSciNet Google Scholar
Žerovnik, J.: Heuristics for np-hard optimization problems-simpler is better!? Logist. Sustain. Transp. 6(1), 1–10 (2015)
Article Google Scholar
Nayak, K., Wang, X.S., Ioannidis, S., Weinsberg, U., Taft, N., Shi, E.: Graphsc: Parallel secure computation made easy. In: S&P, pp. 377–394 (2015)

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos: 62072390, U1936205, U1636205, 61572413, 62072125) and the Research Grants Council, Hong Kong SAR, China (Grant Nos: 15238116, 15222118, 15218919, 15203120).

Author information

Authors and Affiliations

Shanghai Key Lab of Intelligent Information Processing School of Computer Science, Fudan University, Shanghai, China
Kai Huang & Shuigeng Zhou
Department of Electronic and Information Engineering, Hong Kong Polytechnic University, Kowloon, Hong Kong
Haibo Hu & Qingqing Ye
Hong Kong and PolyU Shenzhen Research Institute, Shenzhen, China
Haibo Hu & Qingqing Ye
Department of Computer Science and Technology, Tongji University, Shanghai, China
Jihong Guan
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Xiaofang Zhou

Authors

Kai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Hu
View author publications
You can also search for this author in PubMed Google Scholar
Shuigeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Guan
View author publications
You can also search for this author in PubMed Google Scholar
Qingqing Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Proofs

In this section, we present the formal proofs of theorems and lemmas.

1.1 A.1 Proof of Theorem 1

Given a data graph G, the graph outsourcing problem with t-closeness and k-automorphism is to compute an outsourced graph $G'$, where t-closeness for its labels is required. Since t-closeness is a known NP-Hard problem [56] and can be reduced to our graph outsourcing problem, the graph outsourcing problem with t-closeness and k-automorphism is NP-Hard. In addition, our subgraph matching problem is NP-Hard since it involves subgraph isomorphism testing, which is a classical NP-Hard problem [32, 57, 58]. Overall, both graph outsourcing problem with (k, t)-privacy and subgraph matching problem on outsourced graphs are NP-Hard.

1.2 A.2 Proof of Lemma 1

Let $l=(l_1,l_2,\ldots ,l_n)$ be ordered labels and $(1/n,1/n,\ldots ,1/n)$ their distribution masses. We define $\alpha $-th Alignment Group (denoted by $g_{\alpha }$) as m consecutive labels in l, i.e., $g_{\alpha }$ $=$ $(l_{(\alpha -1)m+1}$, $l_{(\alpha -1)m+2}$, $\ldots $, $l_{(\alpha -1)m+\beta },\ldots ,l_{\alpha m})$ (Fig. 21). In addition, let feasible column $y_j$ be ordered labels $(e_1,\ldots ,e_\alpha , \ldots ,e_{n/m})$ with evenly distributed mass $(m/n,m/n,\ldots ,$ m/n). Since labels are ordered, according to [10], the minimal workload of $EMD(l,y_j)$ can be achieved by satisfying all elements of l sequentially, i.e., sequentially move distribution masses from $y_j$ to l. In particular, as depicted in Fig. 21, $e_\alpha $ should transport $\frac{1}{n}$ distribution mass to each label in $g_{\alpha }$ $=$ $(l_{(\alpha -1)m+1}$, $l_{(\alpha -1)m+2}, \ldots ,l_{(\alpha -1)m+\beta }$,$\ldots $,$l_{\alpha m})$. In short, each element $e_\alpha $ in $y_j$ should be “aligned” with the $\alpha $-th alignment group (i.e., transport distribution mass to elements in $\alpha $-th alignment group), and $e_\alpha $ should transport $\frac{1}{n}$ distribution mass to each element of $\alpha $-th alignment group.

1.3 A.3 Proof of Lemma 2

According to Lemma 1, each element $e_\alpha $ in $y_j$ is aligned with $\alpha $-th alignment group (i.e., $g_\alpha $). In addition, observe that the subscripts of elements in alignment group $\alpha $ are $(\alpha -1)m+1, (\alpha -1)m+2,\ldots ,(\alpha -1)m+\beta ,\ldots ,(\alpha -1)m+m$, respectively, and the ground distance between $e_\alpha $ and $\beta $-th element in alignment group $\alpha $ is $\frac{|i-(\alpha -1)m-\beta |}{n-1}$ where i is the position of $e_\alpha $ in l. Therefore, we derive that the ground distance between $e_\alpha $ and $\alpha $-th alignment group is

$$\begin{aligned} \frac{1}{n-1}\sum _{\beta =1}^{m}\Big | i-(\alpha -1)m - \beta \Big | \end{aligned}$$

where i is $e_\alpha 's$ position in l. To estimate its domain, three cases should be considered:

(1)
If $i \le (\alpha -1)m+1$,
$$\begin{aligned} \begin{aligned} Dist(e_\alpha ,g_\alpha )&= \frac{1}{n-1} ( (\alpha -1)m^2 + \frac{(1+m)m}{2} - im) \\&= \frac{2n^2\alpha - 2(n^2/m)i - 2n^2 + (n/m+n)n}{2(n-1)(n/m)^2}. \end{aligned} \end{aligned}$$
$Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - n^2/m}{2(n-1)(n/m)^2}, \frac{2n^3/m - 2n^3/m^2 - n^2 + n^2/m}{2(n-1)(n/m)^2}]$.
(2)
If $(\alpha -1)m+1 \le i \le (\alpha -1)m+m$,
$$\begin{aligned} \begin{aligned} Dist(e_\alpha ,g_\alpha )&= \frac{1}{n-1}\left( \sum _{\beta _1=1}^{\beta }(\beta -\beta _1)+\sum _{\beta _2=1}^{m-\beta }\beta _2 \right) \\&=\frac{2\beta ^2\alpha - 2(1+m)\beta + m + m^2}{2(n-1)}. \end{aligned} \end{aligned}$$
If m is odd, $Dist(e_\alpha ,g_\alpha )$ $\in $ $[\frac{n^2 - n^2/m^2}{4(n-1)(n/m)^2}$, $\frac{n^2 - n^2/m}{2(n-1)(n/m)^2}]$, otherwise, $[\frac{n^2}{4(n-1)(n/m)^2},$ $\frac{n^2 - n^2/m}{2(n-1)(n/m)^2} ]$.
(3)
If $i \ge (\alpha -1)m+m$,
$$\begin{aligned} \begin{aligned} Dist(e_\alpha ,g_\alpha )&= \frac{1}{n-1}( im - (\alpha -1)m^2 - \frac{(1+m)m}{2}) \\&= \frac{\frac{2n^2}{mi} - 2(\frac{n}{m})^2n - 2n^2\alpha + 2n^2 -(\frac{n}{m}+n)n}{2(n-1)(n/m)^2}, \end{aligned} \end{aligned}$$

$Dist(e_\alpha ,g_\alpha )$ $\in $ $[\frac{n^2 - n^2/m}{2(n-1)(n/m)^2},$ $ \frac{2n^3/m - 2n^3/m^2 - n^2 + n^2/m}{2(n-1)(n/m)^2}]$.

Therefore, for $\forall i \in [(\alpha -1)m+1, (\alpha -1)m+m ]$, $Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - \frac{n}{m}^2}{4(n-1)\frac{n}{m}^2}$, $\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]$ (if m is odd) or $Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2}{4(n-1)\frac{n}{m}^2}$, $\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]$ (if m is even).

1.4 A.4 Proof of Theorem 2

Lemma 2 proved that $\forall i \in [(\alpha -1)m+1, (\alpha -1)m+m ]$, if m is odd, $Dist(e_\alpha ,g_\alpha ) \in [\frac{n^2 - \frac{n}{m}^2}{4(n-1)\frac{n}{m}^2}$, $\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]$. Otherwise, $Dist(e_\alpha $, $g_\alpha ) \in [\frac{n^2}{4(n-1)\frac{n}{m}^2}$, $\frac{n^2 - \frac{n^2}{m}}{2(n-1)\frac{n}{m}^2}]$. Each element $e_\alpha $ of $y_j$ generated by initial solution is selected from the i-th position of l, where $i \in [(\alpha -1)m+1,(\alpha -1)m+m ]$. In addition, Lemma 1 showed that each element $e_\alpha $ in $y_j$ is supposed to transport 1/n distribution mass to each element in $\alpha -$th alignment group. Based on those two observations, we derive that $EMD(l,y_j) \le \sum _{\alpha =1}^{n/m} \frac{n^2 - n^2/m}{2(n-1)(n/m)^2}\times \frac{1}{n}$ $= \frac{mn^2-n^2}{2(n-1)n^2} = \frac{m-1}{2(n-1)}$. Therefore, when $t \ge \frac{m-1}{2(n-1)}$, the column $y_j$ satisfies $t-$closeness. Similarly, each column generated in subproblem also satisfies $t-$closeness. By the way, we can adopt the similar way to prove that the EMD between l and any other column is bounded by $\frac{2mn-2n-m^2+m}{2(n-1)m}$.

1.5 A.5 Proof of Lemma 3

Let l=$(l_1,l_2,\ldots ,l_n)$ be n labels ordered by their values, and a the Euler–Mascheroni constant ($\approx $ $ \frac{1}{ln(n)+0.5772+1/2n}$), the frequency of l can be represented by $(\frac{a}{1}, \frac{a}{2},\ldots , \frac{a}{n})$, since the frequencies of labels roughly obey the Zipf’s law [3]. When $t \ge \frac{m-1}{2(n-1)}$, sub-optimal TOGGLE generates the initial partition $\{y_j|j\in [1,m]\}$ where $y_{i,j}=1$ if i locates in $\{j,j+m,\ldots ,j+(n/m-1)m\}$ or $y_{i,j}=0$, otherwise. Let the sum of label frequencies of $y_j$ be

$$\begin{aligned} s_j = \frac{a}{j}+ \frac{a}{j+m}+\ldots + \frac{a}{j+(n/m-1)m}, \end{aligned}$$

the cost of $y_j$ is obviously $s^2_j$. Therefore, the total cost of the initial solution is $s^2_1+s^2_2+\ldots +s^2_m$ where $s_1+s_2+\ldots +s_m =1$ and $s_1>$ $s_2>$ $\ldots $ $>s_m$. Due to

$$\begin{aligned} \begin{aligned} s_1&= \frac{a}{1}+ \frac{a}{1+m}+\ldots + \frac{a}{1+(n/m-1)m} \\&\le \frac{a}{1}+ \frac{a}{m}+\ldots + \frac{a}{(n/m-1)m} \\&\le \frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m}, \end{aligned} \end{aligned}$$

we can derive that $\frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m} \ge s_1>s_2>\ldots >s_m$, and

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{m}s_i^2&= \left( \sum _{i}^{m}{s_i} \right) ^2-s_1\left( \sum _{i \ne 1}^{m}s_i \right) -s_2\left( \sum _{i \ne 2}^{m}s_i \right) -\ldots -s_m\left( \sum _{i \ne m}^{m}s_i \right) \\&\le \frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m}. \end{aligned} \end{aligned}$$

Therefore, the model cost of initial solution under $t-$closeness constraints is at most $ \frac{1}{m} + \frac{(m^2-m+2)a}{m^2+m}$. To estimate the approximate ratio $R_1$ of our model cost to the exact model cost, we first relax the $t-$closeness constraint to find the minimum model cost. Formally, for any $\{X_i\}$ subjecting to $\sum _{i=1}^{m}X_i=1$, we need to estimate the lower bound of $\sum _{i=1}^{m}X_i^2$. According to Cauchy–Schwarz inequality, for $X_i,Y_i \in {\mathcal {R}}$, $\big (\sum _{i=1}^{m}X_iY_i\big )^2 \le \big (\sum _{i=1}^{m}X_i^2\big )\big (\sum _{i=1}^{m}Y_i^2\big )$. Let $Y_i=1$, we derive that $\frac{1}{m} \le \sum _{i=1}^{m}X_i^2$. Therefore, the minimum model cost, $s^2_1+s^2_2+\ldots +s^2_m$, is no less than $\frac{1}{m}$. The approximate ratio $R_1 \le \frac{ 1/m + (m^2-m+2)a/(m^2+m)}{1/m} \le 1 + \frac{(m^2-m+2)a}{m+1}$. The approximation is good since the approximate ratio is approximately liner to $m \cdot a$.

1.6 A.6 Proof of Lemma 4

From Lemma 3, we observe that the sum of labels frequencies of $y_j$ is $s_j = \frac{a}{j}+ \frac{a}{j+m}+\ldots + \frac{a}{j+(n/m-1)m}$ where $s_1+s_2+\ldots +s_m =1$ and $s_1>$ $s_2>$ $\ldots $ $>s_m$. If we denote the first dual solution of the master problem as $\mu = [s^2_1,s^2_2,\ldots ,s^2_m,0,0,\ldots ,0]$, the objective values of the original subproblem and the reduced problem can be formulated as $J_2=min(c(y_j)-\mu y_j),~s.t., ~EMD(l,y_j)\le t$ and $ J_{2}' = min(c(y_j)-\mu y_j)$, s.t., QKP Constraints, respectively. Intuitively, we can derive that

$$ \begin{aligned} \frac{J_2}{J_{2}\prime }&\le \frac{\sum _{i=1}^{n}s^2_iy_{i,j}- (\sum _{i=1}^{n} \frac{ay_{i,j}}{i})^2 }{ s^2_1 - (\frac{a}{1}+\sum _{i=2}^{n/m}\frac{a}{im})^2 } \\&\le \frac{\sum _{i=1}^{n}s^2_iy_{i,j} }{ s^2_1 - (\frac{a}{1}+\sum _{i=2}^{n/m}\frac{a}{im})^2 } \le \frac{\sum _{i=1}^{n}s^2_iy_{i,j} }{ 2s_1 \times (\frac{a}{1}+\sum _{i=2}^{n/m}\frac{a}{im})} \\&\le \frac{\sum _{i=1}^{n}s^2_iy_{i,j} }{ 2s_1 \times s_m } \le \frac{\sum _{i=1}^{n/m}s^2_i }{ 2s_1 \times s_m } = \frac{\sum _{i=1}^{n/m}s_i\times s_1 }{ 2s_1 \times s_m } \\&= \frac{1}{2} \big ( \frac{s_1}{s_1}\frac{s_1}{s_m} + \frac{s_2}{s_1}\frac{s_2}{s_m} +\ldots + \frac{s_{n/m}}{s_1}\frac{s_{n/m}}{s_m}\big )\\&\le \frac{1}{2} \big ( \frac{s_1}{s_1}\frac{s_1}{s_m} + \frac{s_2}{s_1}\frac{s_1}{s_m} +\ldots + \frac{s_{n/m}}{s_1}\frac{s_1}{s_m}\big )\\&\le \frac{1}{2} \big ( \frac{s_1}{s_1} + \frac{s_2}{s_1} +\ldots + \frac{s_{n/m}}{s_1}\big ) \frac{s_1}{s_m} \\&\le \frac{1}{2} \big ( \frac{s_1+s_2+\ldots +s_{n/m}}{s_1}\big ) \frac{s_1}{s_m} \le \frac{1}{2} \frac{1}{s_1}\frac{s_1}{s_m} \le \frac{m}{4a}. \end{aligned} $$

Therefore, the approximate ratio of $ J_{2}'$ to $J_2$ is no less than 4a/m where a is the Euler–Mascheroni constant.

1.7 A.7 Proof of Theorem 3

Let the optimal solution to original problem be opt, and the initial solution $R_1\times $ opt, if the first reduced cost of the column generation method is $J_2$, then $\textsc {opt}= R_1\times \textsc {opt} - \gamma \times J_2$ where $\gamma \ge 1$ and $\gamma = (R_1-1)$ opt$/J_2$. Similarly, if the first reduced cost of the sub-optimal method is $ J_{2}'$, we can derive the objective value $X = R_1\times \textsc {opt} - \gamma ' \times J_{2}'$.

On the basis of those two lemmas, we can prove that if $ \gamma \le \gamma '$, the approximate ratio is $\textsc {opt}/X \ge \textsc {opt}/(R_1\times \textsc {opt} - \gamma \times J_{2}') = \textsc {opt}/(R_1\times \textsc {opt} - ((R_1-1)\textsc {opt}/J_2) \times J_{2}') \ge 1/(1+ (m^3-5m^2+6m-8)a/(m^2+m))$. Otherwise, $\textsc {opt}/X = \textsc {opt}/(R_1\times \textsc {opt} - \gamma ' \times J_{2}')\ge \textsc {opt}/(R_1\times \textsc {opt}) \ge 1/(1 + (m^2-m+2)a/(m+1))$. Therefore, $\textsc {opt} \le X \le (1+ (m^3-5m^2+6m-8)a/(m^2+m))\textsc {opt}$ or $(1 + (m^2-m+2)a/(m+1))\textsc {opt} \approx (1+0.2m)\textsc {opt}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, K., Hu, H., Zhou, S. et al. Privacy and efficiency guaranteed social subgraph matching. The VLDB Journal 31, 581–602 (2022). https://doi.org/10.1007/s00778-021-00706-0

Download citation

Received: 12 November 2020
Revised: 20 July 2021
Accepted: 29 September 2021
Published: 11 November 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s00778-021-00706-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Privacy and efficiency guaranteed social subgraph matching

Abstract

Similar content being viewed by others

An Efficient Framework for Multiple Subgraph Pattern Matching Models

A Privacy-Preserving Framework for Subgraph Pattern Matching in Cloud

A Survey of Privacy Preserving Subgraph Matching Methods

Explore related subjects

1 Introduction

Example 1

Example 2

2 Background and problem statement

2.1 t-Closeness

Definition 1

Example 3

2.2 k-Automorphism

Example 4

2.3 Graph privacy attacks and threat model

Definition 2

Example 5

Definition 3

2.4 Problem statement

Definition 4

Definition 5

Example 6

Theorem 1

3 Solution overview

3.1 Baseline solution

3.1.1 Graph outsourcing with (k, t)-privacy

3.1.2 Subgraph matching

3.2 Limitations

Example 7

4 TOGGLE for label generalization

4.1 Estimating search space for subgraph matching

4.2 Optimal TOGGLE for label generalization

Definition 6

4.3 Sub-optimal TOGGLE for label generalization

4.3.1 Initial solution (Lines 2 to 8)

4.3.2 Master problem (Lines 12 to 13)

4.3.3 Subproblem (Lines 14 to 15)

4.3.4 Analysis on TOGGLE

Theorem 2

Lemma 1

Lemma 2

Theorem 3

Lemma 3

Lemma 4

5 PGP subgraph matching algorithm

6 Experimental results

6.1 Experimental setup

6.2 Performance of sub-optimal TOGGLE

6.3 Performance of PGP algorithm

6.4 Performance of privacy–utility

7 Related work

8 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A Proofs

Appendix A Proofs

1.1 A.1 Proof of Theorem 1

1.2 A.2 Proof of Lemma 1

1.3 A.3 Proof of Lemma 2

1.4 A.4 Proof of Theorem 2

1.5 A.5 Proof of Lemma 3

1.6 A.6 Proof of Lemma 4

1.7 A.7 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation