Keywords

1 Introduction

In recent years, more and more data has been collected and stored on the Internet. These information are not independent but have relationships such as social networks, communication networks etc. Many applications are improved with information from the social networks such as network-based recommendation systems [9], sybil defenses [10]. However, the graph owners hesitate to publish their data because that may result in leaking the confidential information. Graph perturbation emerges as an inevitable solution and receives more attention from the research community. The re-identification on Netflix [6] is a typical example showing that simply anonymizating graph is not enough because adversaries can easily get desired information about a particular person if the personal privacy is not considered carefully in graph publishing.

Differential privacy [1] is an in-focus paradigm for publishing useful statistical information over sensitive data with the rigorous privacy’s guarantees. Differential privacy has been successfully applied to a wide range of data analysis tasks and found to have tight relations to other fields such as cryptography, statistics, complexity, combinatorics, mechanism design and optimization. However, it is not easy to apply differential privacy to non-tabular databases like graphs where relationships exist among separated entities.

In this paper, we propose a new privacy definition, subgraph-differential privacy, which applies the conventional differential privacy to graphs. We also introduce a mechanism satisfying subgraph-differential privacy. Finally, we evaluate the mechanism on real graphs and show that our mechanism provides strong privacy while retaining utility.

The paper is organized as follows. In Sect. 2, we summarize the related work. Section 3 reviews the basic concepts of differential privacy. In Sect. 4, we focus on our proposal, called subgraph-differential privacy, the definition and mechanism. In Sect. 5, we show our experiments and evaluate our scheme on real graphs. The conclusion and future work is addressed in Sect. 6.

2 Related Work

Several existing works try to apply the robust privacy definition, differential privacy, to graph data. There are two approaches in literature of graph data publishing: the interactive and non-interactive mechanism. In the so-called interactive setting, information is protected inside a graph handled by the data owner, and access is allowed only through an interface. In the non-interactive setting these problems are addressed by releasing once and for all the graph or its model which we think is interest to most analysts, while still preserving privacy.

In the perspective of the interactive publishing, [12] designs the node differentially private algorithms, that are, algorithms whose the output distribution does not change significantly when a node and all its adjacent edges are added/deleted to a graph. The main idea behind their techniques is to project the input graph onto a set of graphs with their maximum degree below a certain threshold. By this way, node privacy is easier to achieve in bounded-degree graphs since the insertion of one node affects only a relatively small part of the graph.

In the perspective of the non-interactive publishing, F. Ahmed et al. [4] propose a random projection approach which publishes the adjacency matrix of a given graph. This approach utilizes random matrix theory to reduce the dimensions of the adjacency matrix and achieves differential privacy by adding small amount of noise. A. Sala et al. [5] describes graph into dK-graph model. They introduce the dK-perturbation algorithm that computes the noise injected into dK-2 to obtain differential privacy. This approach becomes more efficient if dK-series is clustered.

3 Background

Differential privacy ensures that the outcome of any analysis on database is not influenced substantially by the existence of any individual. An adversary therefore hardly inference attacks on any data rows.

Definition 1

(Differential Privacy [1]): A randomize function \( \mathcal {K} \) gives \( \epsilon \)-differential privacy if for all datasets \( D_1 \) and \( D_2 \) differing on at most one element, and all \( S \subseteq Range(\mathcal {K}) \)

$$\begin{aligned} Pr[\mathcal {K}(D_1) \in S] \le \exp (\epsilon ) Pr[\mathcal {K}(D_2) \in S] \end{aligned}$$
(1)

where \(\mathrm {Range}(\mathcal {K})\) denotes the output range of the algorithm \(\mathcal {K}\).

The most popular mechanism achieving \( \epsilon \) -DP is calibrating Laplace noise to query answer. The standard deviation of noise depends on the sensitivity of function \( \mathcal {K} \).

Theorem 1

(Laplace mechanism [2]): For all randomize function \( \mathcal {K}:D \rightarrow \mathbb {R}^d \), the following mechanism is \( \epsilon \)-DP:

$$\begin{aligned} San_{\mathcal {K}}(\mathbf x ) = \mathcal {K}(\mathbf x ) + (Y_1,..,Y_d) \end{aligned}$$
(2)

where the \( Y_d \) are drawn i.i.d. from \( Lap(\varDelta \mathcal {K}/\epsilon ) \)

4 Subgraph-Differential Privacy

4.1 Problem

Given a graph G representing graph data, the graph owner wants to release the anonymized and possibly sanitized network graphs to commercial partners and academic researchers. Therefore, we take it for granted that attackers will access to such data.

The structural attack model on graph data is a class of attacks that adversaries somehow can collect a set of vertices and their trust relationships which are represented as edges between those vertices. [3] proposes the k-neighborhood graph attack in which adversaries possess a subgraph formed from a specific vertex and its neighborhoods in d-hops distance. Another attack assume that an adversary can collect a relatively large graph whose memberships partially overlap with the original graph [7].

In our work, we consider subgraph attack model which is similar to Neighborhood Attack Graph (NAG) in [8]. Assume that adversary can somehow collect an arbitrary connected graph made from any set of vertices. General speaking, adversary only has partial neighborhoods of a target user/vertex but he may know neighborhoods of the target’s neighborhoods. Our model does not limit vertices having relationship with the target vertex.

4.2 Definition

The conventional DP limits the adversaries’ ability to conclude which neighbor database the output database comes from. In context of tabular datasets, two datasets are considered as neighbor datasets if they exactly differ from one tuple. This definition is no longer appropriate in the context of graph data. We define a new neighbor graph definition in which a separated entity is a subgraph.

Definition 2

(Subgraph-based neighbor graphs): Given graph \( G = (V,E) \), the set of k vertices \( V_k \subseteq V \) and \( E_k\subseteq E \) is set of edges between vertices in \( V_k \). \( Neighbor(G,V_k) \) is defined as follows:

$$\begin{aligned} Neighbor(G,V_k) = \{G_i|\forall G_k \in \mathbb {G}^k, G_i = G_d\parallel G_k\} \end{aligned}$$

in which \( G_d = G \setminus \{e|e \in E_k\} \)

Intuitively, two graphs are neighbors if they are different from exactly one subgraph, given the set of k vertices \( V_k \) (Fig. 1).

Fig. 1.
figure 1figure 1

Example of subgraph-based neighbor graphs

Definition 3

(Subgraph-Differential privacy): A randomized function \( \mathcal {K}:\mathbb {G}^n \rightarrow \mathbb {G}^n \) is (\( k,\epsilon \))-subgraph-differential-privacy if given graph \( G = (V,E) \); for all connected subgraph \( G_k = (V_k,E_k) \), in which \( V_k \subseteq V \) is a set of k vertices, \( E_k\subseteq E \) is set of edges between vertices in \( V_k \); for all pair of graphs \( G_1,G_2 \in Neighbor(G,V_k) \); for all \( S \in Range(\mathcal {K}) \)

$$\begin{aligned} Pr[\mathcal {K}(G_1) \in S] \le \exp (\epsilon ) Pr[\mathcal {K}(G_2) \in S] \end{aligned}$$
(3)

Subgraph-DP is against subgraph-based attacks. By observing the perturbed subgraph, adversaries cannot figure out which subgraph that observed subgraph comes from with a high confidence.

Privacy parameter \( \epsilon \) and k control how much privacy leaks. Parameter k is introduced as a new parameter for graph data. Obviously, k measures how large subgraph is. In fact, the graph owners do not need configure a large value of k. We suggest \( k=3 \) is enough.

4.3 Mechanism

Consider a given graph \( G = (V,E) \) as a complete graph. \( E_r \) is the set of real edges in G, \( E_r = E \) and \( E_v \) is the set of virtual edges which do not exist in G. The underlining idea of our mechanism is very simple. The graph is perturbed by rewiring edges. Rewiring an edge means that the edge changes its state, from real to virtual and vice versa. A set of edges, including real and virtual edges, is selected such that every k -vertices connected subgraph in G is perturbed. Each edge is assigned a weight w. The range of w is [0, 1]. The weight measures how “important” that edge is. Its weight closes to 0 means that it is not important and we can rewire it without too much changes in the graph features. Note that the terminology important here is an abstract definition that we define.

Overall, the mechanism comprises three stages. In stage 1, we select a set of edges which are injected noise in the perturbation process. The execution of our mechanism needs some parameters, therefore, these parameters are configured in the stage 2. In the final step, Laplace noise is generated and injected to weight of selected edges. Noise makes a change in the importance of an edge to graph features toward two tendencies, namely the edge becomes more or less important. A pre-defined threshold \( \theta \) is used to decide whether an edge is rewired or not. We will explain in detail in the rest of this section.

figure afigure a

Selection Strategy. For every vertex \( v \in G \), we construct Breadth First Search (BFS) on G, starting from v. Because we consider subgraphs with exactly k vertices, we only need traversal the vertices within \( (k-1) \)-hops distance from v. The purpose of constructing BFS with root v is that we want to select edges which have one vertex is v, and put them in \(E_s\). Certainly, \(E_s\) should be minimum while it still guarantees the condition. However, it is not trivial task. In our work, we propose a flexible selection strategy which does not give the optimal set but it is simple and still satisfies the condition.

For each BFS, if all real edges are selected, the condition satisfies certainly. But the graph features may change significantly due to a large amount of edges are deleted without any compensation of new edges introduced in the perturbed graph. Therefore, virtual edges should be selected even though only real edges are enough. Random selecting is the most simple way that we can consider. Selecting the edges having maximum or minimum weight among candidates is another strategy. This strategy, actually, depends on which feature we want to preserve.

We introduce \( \beta \), the ratio between the number of real edges and that of virtual edges when the selection performs in a particular BFS. It is noted that neither the ratio always is \( \beta \) in each BFS nor after the whole selection.

Figure 2 is an example of a graph G and its BFS starting from vertex 1. Every vertices, excluding 1, in BFS are candidates for selection. If the random selection strategy is used and \( \beta = 1 \), two edges in \( \{(1,3),(1,5),(1,7)\} \) are selected randomly; for instance, (1, 3) and (1, 5) are selected. In summary, from this BFS, \( \{(1,2),(1,4),(1,3),(1,5)\} \) are supplemented to \( E_s \).

Fig. 2.
figure 2figure 2

An example of the edges selection. (a) The original graph G ; (b) The BFS starts from 1

Graph Perturbation. In the final step, an appropriately pre-defined random noise will be injected into weight of each \( e \in E_s \). For each edge, if the new weight is smaller than threshold \( \theta \) that edge is rewired and remains unchanged if the new weight is larger than \( \theta \). The random noise follows Laplace distribution with \( \mu = 0 \) and \( \sigma \) depends on the privacy budget \( \epsilon \).

The injected noise should be large enough to guarantee subgraph-DP Eq. (3). Theoretically, we can select \( \theta \) in the range of [0, 1], which is the range of weight. However, we fix \( \theta = 0 \) for following reason. Intuitively, according to the definition of subgraph-DP, when \( \epsilon \) is large, injected noise becomes small, whereby the graph feature changes a little bit. Given large enough \( \epsilon \), \( G^{\prime } \) should be the same as G, which means no edge or just very little edges are injected noise. If \( \theta \in [0,1] \), even though given large enough \( \epsilon \), a fixed part of edges is still injected noise. This seems quite odd.

Theorem 2

Algorithm 1 satisfies Subgraph-DP given privacy parameter \( \epsilon \)

Proof

SelectEdges(G,k) in line 2 guarantees that every k -vertices connected subgraph in G is perturbed as the above explanation.

Consider a set of k vertices \( V_k \) and \( G_1,G_2 \in Neighbor(G,V_k) \) and \( S \in Range(\mathcal {K}) \).

$$\begin{aligned} Pr[\mathcal {K}(G_1) \in S] = \varPi ^{N_k}_{i=1}Pr_{G_1,i} \end{aligned}$$

in which \( N_k = \frac{k(k-1)}{2} \) and \( Pr_{G_1,i} \) is probability of the \( i^{th} \) edge changing its state in \( G_1 \) to that in \( G^{\prime } \)

Given \( p_i \) is probability of rewiring the \( i^{th} \) edge.

$$\begin{aligned} p_i = Pr[w_i + noise \le \theta ] = Pr[Lap(\sigma ) \le \theta - w_i] \end{aligned}$$

\( \theta = 0 \) and noise is Laplace noise \( Lap(\sigma ) \) with \( \mu = 0 \)

Basically, \( p_i \) is the cumulative distribution at \( (\theta - w_i) \)

$$\begin{aligned}p_i = \frac{1}{2}\exp (-\frac{w_i}{\sigma }) \end{aligned}$$

For the \( i^{th} \) edge, \( p_i \le \frac{1}{2} \) because \( w_i \in [0,1] \), therefore

$$\begin{aligned}\frac{Pr_{G_1,i}}{Pr_{G_2,i}} \le \frac{1-p_i}{p_i} = \frac{1}{\frac{1}{2}\exp (-\frac{w_i}{\sigma })} - 1 \le \frac{1}{\frac{1}{2}\exp (-\frac{1}{\sigma })} - 1 = \exp (\epsilon _i) \end{aligned}$$
$$\begin{aligned}\frac{Pr[\mathcal {K}(G_1) \in S]}{Pr[\mathcal {K}(G_2) \in S]} = \frac{\varPi ^{N_k}_{i=1}Pr_{G_1,i}}{\varPi ^{N_k}_{i=1}Pr_{G_2,i}} \le \varPi ^{N_k}_{i=1}\exp (\epsilon _i) = \exp (\varSigma ^{N_k}_{i=1}\epsilon _i) = \exp (\epsilon ) \end{aligned}$$

Parameter Setting. Given w is weight of the edge (ij), regardless real or virtual edge, between two vertices i and j, w is computed from metric m of this edge.

$$\begin{aligned} w(i,j) = -\sigma \log (m(i,j)) \end{aligned}$$
(4)

The metric m(ij) measures how strong the connection between i and j is. We introduce the normalized mutual friends as the metric.

$$\begin{aligned} m(i,j) = \frac{[\#mutual friends]}{2}(\frac{1}{deg(i)}+\frac{1}{deg(j)}) \end{aligned}$$
(5)

in which deg(i) is node degree of vertex i.

Note that different metrics have different specific ranges, however, the range of the weight is fixed in [0, 1]. Therefore, we have to translate the original range of metric to a new range \( [\exp (-\frac{1}{\sigma }),1] \).

We define a target equation which decides which feature we want to preserve in the perturbed graph.

$$\begin{aligned} \sum _{e \in E_s \cap E_r} p_e c_e = \sum _{e \in E_s \cap E_v} p_e c_e \end{aligned}$$
(6)

in which, \( p_e \) is the probability of rewiring an edge and \( c_e \) is the cost for rewiring that edge.

The cost \( c_e \) measures how much an edge impacts a specific graph feature. The target equation means that the cost for deleting the edges should be the same as the cost for adding new edges. In fact, appraising judiciously the cost \( c_e \) is not trivial in some cases. We introduce two case studies.

Case 1: Preserving Average Node Degree. It is easily seen that the cost for deleting an edge and adding an edge is the same. Therefore, the target function is re-written as follow:

$$\begin{aligned} \sum _{e \in E_s \cap E_r} p_e = \sum _{e \in E_s \cap E_v} p_e \end{aligned}$$
(7)

Case 2: Preserving the Number of Triangles. Triangles play an important role in graph analysis. If an edge (ij) is deleted or added, the decrease/increase in the number of triangles exactly equals the number of mutual friends of i and j. Therefore, we can use the number of mutual friends as the cost of deleting/adding an edge. This cost, in fact, does not assess thoroughly triangle counting. However, our experimental results prove that the number of mutual friends is accurate enough to achieve triangle counting preservation if the number of the selected real and virtual edges approximatively are the same.

$$\begin{aligned} \sum _{e \in E_s \cap E_r} p_e mu_e = \sum _{e \in E_s \cap E_v} p_e mu_e \end{aligned}$$
(8)

in which \( mu_e \) is the number of mutual friends of two vertices connected by edge e.

Unfortunately, it is difficult to achieve both Eqs. (7) and (8) in practice. To guarantee the target equation has a solution, the naive method we can consider is to scale again the range of metric to new range \( [\exp (-\frac{1}{\sigma }),\alpha ] \) such that the target equation satisfies. The new range is also a subset of \( [\exp (-\frac{1}{\sigma }),1] \). Solving that condition, we can get \( \epsilon _{min} \) such that for \( \epsilon \geqslant \epsilon _{min} \), the target equation has a solution. We do not describe in detail how to compute \( \alpha \) and \( \epsilon _{min} \) here because of the space limitation.

5 Evaluation

We collect real graphs to demonstrate that our proposed privacy definition and mechanism work well in practice, guarantee privacy while is useful in analysis. We run the mechanism in both cases: preserving average node degree and preserving triangle counting and then measure the features of both the original graphs and the perturbed graphs to evaluate the differences between them. Note that we do not consider the case of preserving number of triangles in directed graphs. We implement our mechanism using NetworkX which is a Python language software package for processing complex networks.

We use the data sets that are available on https://snap.stanford.edu/data/. The data sets and their characteristics are described in the Table 1.

Table 1. Data sets and their characteristics

5.1 Preserving Average Node Degree

Table 2 shows the average node degree of the perturbed graphs with \( k = 3 \) and \( \beta = 0.5 \). The average node degree in all cases is preserved in general. The main trend is that the average node degree of perturbed graphs is slightly higher than that of the original graph.

Table 2. Average node degree of perturbed graphs

Simultaneously, we also compute other statistical graph features to verify that how our mechanism influences other features while preserving average node degree. We compute the power law exponent and the clustering coefficient.

Even though power law exponent is also preserved in this case, the clustering coefficient changes much (Fig. 3). Note that in case of preserving average node degree, we expect the number of added and deleted edges are relatively similar. Due to the cost of real and virtual edges are generally different, namely deleting a real edge tends to lose more triangles than an new edge brings in. Therefore the number of triangles decreases dramatically, especially with a small \( \epsilon \).

Fig. 3.
figure 3figure 3

Power law exponent and clustering coefficient of perturbed graphs in case of preserving average node degree

Table 3. Clustering coefficient of perturbed graphs

5.2 Preserving Triangle Counting

Table 3 shows the changes of the perturbed graphs in clustering coefficient. Three graphs are preserved with respect of clustering coefficient. However, all three graphs incur a relatively high \( \epsilon _{min} \).

Similar to the case of preserving average node degree, power law exponent is also preserved in this case (Fig. 4). All graphs have slightly higher average node degree because the cost of a selected real edge in general is higher than that of a selected virtual edge because two vertices with relationship tend to have more mutual friends than two vertices without relationship. If we want to preserve the number of triangle, we have to add more new edges to make sure that the cost for adding and deleting are the same. However, we believe that the difference in the average node degree is acceptable.

Fig. 4.
figure 4figure 4

Power law exponent and average node degree of perturbed graphs in case of preserving the number of triangles

6 Conclusions and Future Work

In this study, we propose a novel privacy framework, subgraph-DP, which is based on definition of differential privacy. Subgraph-DP is a robust framework for graph data where entities have relationships with others. Subgraph-DP is against subgraph-based attacks. We also propose a mechanism which gives subgraph-DP. We introduce the mechanism in two cases: preserving average node degree and preserving the number of triangles. The perturbed graph preserves most of the statistical features of graph. The database owners can appropriately adapt subgraph-DP for their purposes.

However, our work incurs some limitations. Firstly, \( \epsilon _{min} \) in case of preserving the number of triangles is high, which means we cannot protect much privacy in these cases. Secondly, measuring how much an edge effects on specific graph feature is a challenge, especially with complex features such as spectral analysis and node degree distribution. Thirdly, our mechanism works well with small and medium graphs, however, with large graphs running time reaches several hours. As the future work, we plan to overcome these limitations.