1 Introduction

In recent years, a huge amount of social and human interaction networks have been made publicly available. Embedded within this data, there is user’s private information that must be preserved before releasing the data to third parties and researchers. The study by Ferri et al. [17] reveals that up to \(90\%\) of user groups are concerned by data owners sharing data about them. Backstrom et al. [2] point out that the simple technique of anonymizing graphs by removing the identities of the vertices before publishing the actual graph does not always guarantee privacy. In particular, they have shown that an adversary can infer the identity of the vertices by solving a set of restricted graph isomorphism problems. It is evident that network anonymization processes become an important issue under this scenario.

Several methods have been developed to protect users’ privacy on networks, but none of them has been designed specifically for directed networks. Some methods remove the direction of edges in order to convert directed networks to undirected ones, and then, they utilize undirected algorithms to protect users’ privacy. This has two drawbacks. First, if the published network is undirected, the direction of the edges is lost; hence, in the published version there may be connected nodes that were not connected by a directed path in the original directed graph. Second, if the network is anonymized without considering the direction of the relations, then this information may be used for reidentification, that is the case when considering k-degree anonymization without considering the direction of the edges. However, removing the direction of the edges produces a severe loss of information regarding the structure of the network, in the sense that the in-degree and out-degree of each node are combined into a single characteristic that is anonymized using models designed for undirected networks. There are cases where we are interested to treat the in-degree and out-degree sequences of a graph in a different manner—and not as the combined undirected degree—during the anonymization process. For example, in Twitter’s who-follows-whom social graph, one may be interested to consider different levels of anonymity for the in-degree (followers) and the out-degree (followees) of a user, as the out-degree may contain more sensitive information (e.g., in the case of a celebrity), and is also relevant to consider the direction of the relation (who-follows-whom), since the flow of information goes only in one direction (e.g., a celebrity does not know what his followers post).

1.1 Our contributions

In this paper, we define two k-anonymity models specifically designed for directed networks. Additionally, we present algorithms to implement these models and empirically demonstrate their practical application on real directed networks. Since these graphs have no attributes or labels on the edges, information is contained only in the structure of the graph itself and, due to this, preserving network’s structure and edges’ direction are critical to reduce information loss. The contributions of this work can be summarized as follows:

  • We define two different models for k-anonymity on directed networks, offering different privacy protection levels.

  • We introduce algorithms to achieve the desired privacy levels based on the previously proposed models.

  • We show that our algorithms are able to deal with large networks of thousands and millions of vertices and edges, demonstrating their practical utility in real-world problems.

  • We conduct an empirical evaluation of these models on several real networks, comparing information loss based on different graph properties and also on clustering-specific processes.

  • We demonstrate that our models preserve data privacy, while simultaneously conduct the anonymization process toward reducing information loss and increasing data utility.

1.2 Notation

Let \(G=(V,A)\) be a directed and unlabeled graph (also called digraph), where V is the set of vertices (or nodes) and A the set of arcs (or edges) in G. We define \(n=\vert V \vert \) to denote the number of vertices and \(m=\vert A \vert \) to denote the number of arcs. We use \((v_i,v_j) \in A\) to denote a directed arc from vertex \(v_i\) to \(v_j\) but not vice versa. Finally, we denote by \(G=(V,A)\) and \(\widetilde{G}=(\widetilde{V},\widetilde{A})\) the original and the perturbed graph produced by the anonymization process, respectively.

1.3 Roadmap

This paper is organized as follows. In Sect. 2, we review the related work and the state of the art on privacy-preserving methods for networks. Section 3 introduces the preliminary concepts and our k-anonymity models for directed graphs. Then, in Sect. 4, we propose algorithms to fulfill the privacy levels pointed out by our models.Footnote 1 Our experimental framework is provided in Sect. 5, and then, we discuss the results in terms of information loss and data utility in Sect. 6. Experiments about scalability issues are presented in Sect. 7. Lastly, we discuss the conclusions of this work and future research directions in Sect. 8.

2 Related work

The k-anonymity model was introduced in [35, 36] for privacy preservation on structured or relational data. Formally, the k-anonymity model is defined as follows: let \(RT(A_{1},\ldots ,A_{n})\) be a table and \(QI_{RT}\) be the quasi-identifier associated with it. RT is said to satisfy k-anonymity if and only if each sequence of values in \(RT\left[ QI_{RT}\right] \) appears with at least k occurrences in \(RT\left[ QI_{RT}\right] \). The k-anonymity model indicates that an attacker cannot distinguish between different k records although he manages to find a group of quasi-identifiers. Therefore, the attacker cannot re-identify an individual with a probability greater than \(\frac{1}{k}\).

Several concepts can be used as quasi-identifiers for k-anonymity on graph structured data. A widely applied option is to use the vertex degree as a quasi-identifier. Accordingly, we assume that the attacker knows the degree of some target vertices. If the attacker identifies a single vertex with the same degree in the anonymous graph, then he has re-identified this vertex. That is, \(deg(v_{i}) \ne deg(v_{j}) ~ \forall j \ne i\). This model is called k-degree anonymity [24], and these methods are based on modifying the graph structure (by edge modifications) to ensure that all vertices satisfy k-anonymity for their degree. In other words, the main objective is that all vertices have at least \(k-1\) other vertices sharing the same degree. Furthermore, Liu and Terzi [24] developed a method based on dynamic programming and edge switch in order to construct a new k-degree anonymous graph, where \(V=\widetilde{V}\) and \(E \cap \widetilde{E} \approx E\). Their work inspired many other authors who proposed improved solutions based on different kinds of heuristics, such as [7, 11, 20, 27].

Instead of using the vertex degree, Zhou and Pei [38] consider the 1-neighborhood subgraph of the objective vertices (\(\varGamma (v)\)) as a quasi-identifier. For a vertex \(v_0 \in V\), \(v_0\) is k-anonymous in G if there are at least \(k-1\) other vertices \(v_1, \ldots , v_{k-1} \in V\) such that \(\varGamma (v_0), \varGamma (v_1), \ldots , \varGamma (v_{k-1})\) are isomorphic. They demonstrated that the neighborhood anonymity problem for vertex-labeled graphs is NP-hard. Other authors modeled more complex adversary knowledge and used them as quasi-identifiers. For instance, Hay et al. [22] proposed a method called k-candidate anonymity, where a vertex \(v_0\) is k-candidate anonymous with respect to question Q if there are at least \(k-1\) other vertices in the graph with the same answer. Formally, \(\vert cand_{Q}(v_0) \vert \ge k\) where \(cand_{Q}(v_0) = \{ v_j \in V : Q(v_0) = Q(v_j) \} \). A graph is k-candidate anonymous with respect to question Q if all of its vertices are k-candidate with respect to Q. Zhou et al. [40] and Zhou and Pei [39] considered all structural information about a target vertex as quasi-identifier and proposed a new model called k-automorphism to anonymize a network and ensure privacy against this attack. They define a k-automorphic graph as follows: (a) if there exist \(k-1\) automorphic functions \(F_{a} (a=1, \ldots ,k-1)\) in G, and (b) for each vertex \(v_i\) in G, \(F_{a_1}(v_i) \ne F_{a_2} (1 \le a_1 \ne a_2 \le k-1)\), then G is called a k-automorphic graph.

Rossi et al. [31] studied the problem of k-degree anonymization on time-varying (and multilayer) graphs. Let \(\mathcal {G}=\{G_1, \ldots , G_T\}\) be a time-varying graph with a fixed set of vertices V, where \(|V|=n\). In other words, \(\mathcal {G}\) is defined as a sequence of undirected graphs \(G_t=(V, E_T), ~ t=1, \ldots , T\), where \(E_t\) denotes the set of edges at time t. Also, let \(D=\{d_{it}\}\) be the \(n \times T\) degree matrix, where \(d_{it}\) is the degree of the i-th node of \(G_t\). We say that matrix D is a set of k-anonymous vectors, if for every row \(d_{i:}\) there are at least \(k-1\) vectors \(d_{j:}\) such that \(d_{it}=d_{jt}\), for each \(t=1 \ldots , T\). Then, a time-varying graph \(\mathcal {G}\) is defined to be k-degree anonymous, if the degree D defines a set of k-anonymous vectors. Similar to the work of Liu and Terzi [24], the authors of [31] propose a three-step approach where firstly they enforce anonymity, then enforce realizability, and finally construct the graph. However, their realizability constraints are only for undirected graphs.

All the aforementioned methods work only with simple and undirected graphs, and it is not straightforward to extend those methods to directed networks. The naïve approach to convert digraphs to undirected graphs, anonymize them and finally transform back to directed graphs, causes severe perturbations to the graph’s structure. We will provide an empirical example of such approach in Sect. 6. Other works focus on the problem of edge-weight anonymization, e.g., [14] aims at anonymizing the weights of a graph with the aim of preserving the utility for algorithms such as the minimum spanning tree—thus, emphasizes at preserving the inequalities among the edge weights; Liu et al. [23] protects the weights of the edges by adding Gaussian noise to them. To sum up, those methods preserve the edge weights and not the amount of edge relations; hence, they cannot be adapted to degree anonymity for directed networks. Alternatively, other types of privacy-preserving methods can easily be extended to work with directed networks, such as randomization techniques [18] or class-based generalization techniques [3, 13].

However, [3, 13] consider preventing an attacker from learning interactions between entities, which is equivalent to protecting against edge disclosure in bipartite graphs, that for example may represent users and interactions, or costumers and products. The authors of [18] aim at explicitly preserve the degrees of the nodes while randomizing the graphs. Thus, even when adapted for directed graphs, those approaches may still be vulnerable to attacks based on the degrees.

Therefore, in this paper, we are interested in proposing a k-anonymity model specifically designed for directed networks and also to develop algorithms to protect user’s privacy with guarantees of the k-degree anonymity model.

Related to the complexity of k-degree anonymization algorithms, Hartung et al. [21] proved that the problem of degree anonymity (by only adding edges) is NP-hard on 3-colorable graphs and on graphs with H-index three. Also, they proved that there is a polynomial-time algorithm that transforms any instance of the degree anonymity problem into an equivalent instance with at most \(O(\varDelta ^7)\) nodes. A similar result is obtained in [4] for directed graphs, that is, a polynomial size problem kernel for the combined parameter \((s, \varDelta _D)\), where \(\varDelta _D\) denotes the maximum in- or out-degree of the input digraph D and s is the number of edges to be added. We emphasize that both papers obtain solutions for the original question of Liu and Terzi of obtaining a k-degree anonymous graph that contains the original graph as subgraph, and are equivalent to generating graphs with specified degree sequences and excluded graphs, as in [33]. While we argue that the original edges are to be preserved as much as possible, we are aware that there are many cases where this is not possible. So, we propose an algorithm that tries to obtain k-degree anonymous directed graphs by only adding edges until it is necessary to modify the original graph.

3 k-Anonymity models on directed networks

In this section, we define our models based on k-degree anonymity to preserve user’s privacy on directed networks.

3.1 k-degree anonymity

The concept of k-degree anonymity was proposed by Liu and Terzi in [24] for undirected networks, and it can be directly mapped to the degree sequence.

Definition 1

A vector of integers V is k-anonymous, if every distinct value \(v_{i} \in V\) appears at least k times.

Definition 2

An undirected network \(G=(V,E)\) is k-degree anonymous, if the degree sequence of G is k-anonymous.

Let V and W correspond to the degree sequences of the input and anonymized graph, respectively. The distance between two vectors of integers \(V=[v_1, \ldots , v_n]\) and \(W=[w_1, \ldots , w_n]\) is defined by Eq. 1:

$$\begin{aligned} \varDelta (V,W) = \sum _{i=1}^{n} \vert v_i - w_i \vert , \end{aligned}$$
(1)

where \(v_i \in V\), \(w_i \in W\) and \(\vert V \vert = \vert W \vert = n\). The lower the value of \(\varDelta \), the lower the information loss of the anonymized network.

3.2 k-degree anonymity for directed networks

Direct successors of vertex \(v_i \in V\), denoted by \(\varGamma ^+(v_i)\), are defined as the vertices at distance 1 from \(v_i\), i.e., all \(v_j : (v_i,v_j) \in A\). The number of successors is defined as the vertex’s out-degree, \(d_\mathrm{out}(v_i) = \vert \varGamma ^+(v_i) \vert \). Similarly, direct predecessors of vertex \(v_i\) are all vertices from which \(v_i\) can be reached at one hop. That is, \(\varGamma ^{-1}(v_i)= \{ v_j : (v_j,v_i) \in A \}\) and vertex’s in-degree is defined as \(d_\mathrm{in}(v_i) = \vert \varGamma ^{-1}(v_i) \vert \). Therefore, a directed graph is associated with two degree sequences: the in-degree sequence, \(d_\mathrm{in}=\{d_\mathrm{in}(v_1), \ldots , d_\mathrm{in}(v_n)\}\), and the out-degree sequence, \(d_\mathrm{out}=\{d_\mathrm{out}(v_1), \ldots , d_\mathrm{out}(v_n)\}\). Since each arc connects two vertices, it is obvious that:

$$\begin{aligned} \sum _{i=1}^{n} d_\mathrm{in}(v_i) = \sum _{i=1}^{n} d_\mathrm{out}(v_i). \end{aligned}$$
(2)

It is important to note that, in the anonymization process, the same number of arcs have to be added to both in-degree and out-degree, since each added arc implies adding value one to in-degree and also to out-degree. Thus, anonymous in-degree and out-degree have to satisfy Eq. 2.

Fig. 1
figure 1

Toy example showing an anonymization process from the original graph to an Independent 2-degree, Independent (1, 2)-degree and Paired 5-degree anonymous versions of the same network. a Original network, b Independent 2-anonymous, c Independent (1, 2)-anonymous, d paired 5-degree anonymous

Next, we propose two models to achieve different privacy levels according to the k-anonymity model.

3.2.1 Independent (\(k_i\), \(k_o\))-degree anonymity

This model assumes that an adversary knows the in-degree or the out-degree of some target vertices, but does not knows the in- and out-degree of the target vertices.

Definition 3

A directed network \(G=(V,A)\) is Independent (\(k_i\), \(k_o\))-degree anonymous if the in-degree sequence of G is \(k_{i}\)-anonymous and the out-degree sequence is \(k_{o}\)-anonymous.

In the case that \(k_{i}=k_{o}=k\), we simply call it Independent k-degree anonymity.

Definition 4

A directed network \(G=(V,A)\) is Independent k-degree anonymous if both the in-degree and the out-degree sequences of G are k-anonymous.

Example 1

A toy example of Independent k-degree anonymity is shown in Fig. 1. The original network, shown in Fig. 1a, contains 5 vertices and 6 arcs, and its degree sequences are \(d_\mathrm{in}=\{2, 1, 2, 1, 0\}\) and \(d_\mathrm{out}=\{1, 2, 0, 1, 2\}\). Thus, adding just one arc from \(v_3\) to \(v_5\) is enough to convert this network into a Independent 2-degree anonymous graph. Figure 1b shows the anonymous network, which has \(d_\mathrm{in}=\{2, 1, 2, 1, 1\}\) and \(d_\mathrm{out}=\{1, 2, 1, 1, 2\}\).

Example 2

The graph represented in Fig. 1b is also (2, 2)-anonymous according to our definition. However, using this model we are able to create asymmetric privacy levels if we consider that, for example, the out-degree of some target vertices can be the main knowledge of an adversary and we want to protect our network accordingly. Figure 1c shows an Independent (1, 2)-anonymous version of the graph, where \(d_\mathrm{in}=\{2, 1, 2, 1, 3\}\) and \(d_\mathrm{out}=\{1, 1, 2, 2, 1\}\). Hence, it is possible to re-identify a user using in-degree information but it is not possible using information related to out-degree of some target vertices.

3.2.2 Paired k-degree anonymity

This model assumes that an adversary knows both the in-degree and the out-degree of some target vertices. Obviously, this model gives us a higher privacy protection than the above models, since it also protects users from an adversary who knows only the in- or out-degree of some target vertices. We define the paired degree of a vertex as a pair of integer numbers, where the first one is the in-degree of the vertex and the second one is the out-degree, that is, \((d_\mathrm{in}(v_i), d_\mathrm{out}(v_i))\).

Definition 5

A directed network \(G=(V,A)\) is Paired k-degree anonymous if the paired degree sequence of G is k anonymous, i.e., for each pair (ab) representing the in-degree and the out-degree of a vertex, there exist at least \(k-1\) other pairs with the same values.

Notice that, a Paired k-degree anonymous graph is always an Independent (k, k)-degree anonymous one, but not vice versa. Thus, Paired k-degree anonymity is stronger than Independent (k, k)-degree anonymity.

Example 3

Figure 1d presents the Paired 5-degree anonymous version of our toy example. Four arcs must be added to fulfill the properties of this model, and its degree sequences are \(d_\mathrm{in}=\{2, 2, 2, 2, 2\}\) and \(d_\mathrm{out}=\{2, 2, 2, 2, 2\}\). It is interesting to see that this network is also Independent (5, 5)-degree anonymous. Moreover, the network depicted in Fig. 1c is Independent (2, 2)-anonymous, but it is not Paired 2-anonymous. Actually, it is Paired 1-anonymous.

4 Anonymization of directed graphs

In this section, we present the DGA (directed graph anonymization) algorithm, designed to preserve user’s privacy on directed and unlabeled networks according to the proposed anonymization models. We use the concept of k-degree anonymity to anonymize users’ relationship, performing modifications only on the edge set, so as to generate a new anonymous graph \(G^k=(V,A \cup A^k)\), where \(G^k\) is k-degree anonymous and \(\vert A^k \vert \) is minimized.

Our approach to anonymize a directed graph relies on Definition 2. Thus, we anonymize both the in-degree and the out-degree sequences of \(G=(V,A)\) by edge addition in order to meet the k-degree anonymity for a directed graph. Our approach is based on two steps (similar to the one in [24]):

  1. 1.

    Anonymization of degree sequences. We construct a k-degree anonymous sequence \(d_\mathrm{in}^{k}=\lbrace d_\mathrm{in}^{k}(v_1), \ldots , d_\mathrm{in}^{k}(v_n) \rbrace \) from the in-degree sequence \(d_\mathrm{in}=\lbrace d_\mathrm{in}(v_1), \ldots , d_\mathrm{in}(v_n) \rbrace \) of the original graph using Definition 1. The same process is applied to obtain an anonymized version of the out-degree sequence, \(d_\mathrm{out}^{k}\).

  2. 2.

    Adding fake arcs. The second step adds fake arcs between vertices to meet the anonymized in-degree (\(d_\mathrm{in}^{k}\)) and out-degree (\(d_\mathrm{out}^{k}\)), achieving a k-degree anonymous directed graph \(G^{k}=(V, A \cup A^{k})\), where \(\vert A^{k} \vert \) is minimized.

4.1 Step I: Anonymization of degree sequences

This step provides the anonymity level through the in- and out-degree sequences. Therefore, we develop two different strategies according to the privacy models we have introduced previously. First, we present the algorithm for Independent (\(k_i\), \(k_o\))-degree anonymity, and later we propose a second approach for achieving Paired k-degree anonymity. Last but not least, we detail a post-processing method that needs to be applied when Eq. 2 is not satisfied after the anonymization of degree sequences.

4.1.1 Independent (\(k_i\), \(k_o\))-degree anonymity

We refer to (\(k_i\), \(k_o\))-DGA when Independent (\(k_i\), \(k_o\))-degree is considered. The same process is applied both to in-degree (\(d_\mathrm{in}\)) and out-degree (\(d_\mathrm{out}\)) sequences; therefore, we will detail the process on a general degree sequence (d). The objective of this step is to anonymize the degree sequence of the original network, d. Optimal univariate micro-aggregation by Hansen and Mukherjee [19] is used to achieve the best group distribution for both in-degree and out-degree sequences and then we compute the values for each group that minimize the distance from the original degree sequences by Eq. 1. We choose such algorithm with complexity \(O(k^2n)\) for its flexibility; by changing only one parameter, it can compute the optimal k-anonymous degree sequences for different metrics such as euclidean, linear, or any function of the nodes in the k-groups—contrary to Clarkson et al.’s algorithm [12] that has complexity O(n) but is specifically tailored for taking the maximum on the k-groups. Moreover, we implement it with the improvements proposed by [34], which greatly reduce the execution time. Note that, the degree sequence anonymization is the less expensive part, as shown in Table 5.

Our approach starts by applying a permutation f to the degree sequence to reorder the elements. We refer to the ordered degree sequence as a monotonic, non-decreasing sequence of the vertex’ degrees, that is \(d(v_i) \le d(v_j) ~ \forall i < j\). Let k be an integer such that \(1 \le k < n\) which is the k-degree anonymity value, i.e., \(k_i\) in case of in-degree and \(k_o\) otherwise. Typically, k is much smaller than n. In order to apply the optimal univariate micro-aggregation and according to [19], we construct a new directed network \(H_{k,n}\) and get the optimal partition which is exactly the set of groups that corresponds to the arcs of the shortest path from vertex 0 to vertex n on this graph. We denote by \(g=\{g_1, \ldots , g_p\}\) the optimal partition, where \(\frac{n}{2k-1} \le p \le \frac{n}{k}\), and each of them has between k and \(2k-1\) items. Obviously, each \(d_i \in d\) belongs to a specific group \(g_j \in g\). Since our approach relies only on edge addition to modify the graph structure, we have to increase or keep the same degree values, but not to decrease any of them which would be equivalent to an edge removal. Therefore, the optimal partition corresponds to increasing the value of each vertex’s degree up to the maximum value of its group, i.e., \(d_i = \max (d_q)~\forall d_i,d_q \in g_j\). The cost of the shortest path on \(H_{k,n}\) denotes the number of added arcs that is needed in order to meet the k-anonymity value.

4.1.2 Paired k-degree anonymity

We refer to k-DGA when Paired k-degree is considered. In this model, we need to consider simultaneously both the in- and out-degree of each vertex. Thus, each pair \((d_\mathrm{in}(v_i), d_\mathrm{out}(v_i))\) represents the in-degree and the out-degree of a vertex \(v_i\). According to Definition 5, we must find the optimal partition in this 2-dimensional space. The decision problem of finding a paired k-degree anonymous sequence by adding exactly s edges (referred to as the Numbers Only Digraph Degree Anonymity problem) was proven to be NP-hard (Ref. [5], Theorem 23). Hence, we use multivariate micro-aggregation to find quasi-optimal partitions in a reasonable time; specifically, we have applied the MDAV algorithm [15, 16]. Similarly to the aforementioned method, the optimal partition corresponds to increasing the pair values of each vertex’s degree up to the maximum pair values of its group.

4.1.3 Degree sequences post-processing

It is important to note that the same number of arcs needs to be added to the in-degree and out-degree sequences, since each new arc implies adding value one to both the in-degree and out-degree sequence. Consequently, anonymous in-degree and out-degree sequences have to satisfy Eq. 2.

We denote as \(\eta _\mathrm{in}\) the number of added arcs on the in-degree and by \(\eta _\mathrm{out}\) the number of added arcs on the out-degree sequence, for a given k-degree anonymization of a directed graph G. If \(\eta _\mathrm{in} \ne \eta _\mathrm{out}\), our anonymous degree sequences do not satisfy Eq. 2, which is required for directed graphs. Hence, the minimum number of arcs we must add to the original graph is at least \(\max \{\eta _\mathrm{in},\eta _\mathrm{out}\}\), if we consider that \(\eta _\mathrm{in}, \eta _\mathrm{out}\) are the number of edges needed in an optimal micro-aggregation for the in-/out-degree sequences, respectively. Hence, if we get a k-degree anonymous sequence with \(\max (\eta _\mathrm{in},\eta _\mathrm{out})\) edges, then we know that it is optimal.

Let \(S_\mathrm{in}\) and \(S_\mathrm{out}\) be the optimal in- and out-degree sequence partition obtained after applying the micro-aggregation algorithms, where \(S_\mathrm{in} = \cup _{i=1}^p s_\mathrm{in}^i\) and \(S_\mathrm{out} = \cup _{i=1}^q s_\mathrm{out}^i\). Note that the number of partitions does not have to be equal (\(p \ne q\)). Also, it is important to note that the minimal edge addition to fulfill Eq. 2 is represented by finding the minimal values to solve:

$$\begin{aligned} \sum _{i=1}^p c_\mathrm{in}^i + \alpha _i \times \vert s_\mathrm{in}^i \vert = \sum _{i=1}^q c_\mathrm{out}^i + \beta _i \times \vert s_\mathrm{out}^i \vert , \end{aligned}$$
(3)

where \(c_\mathrm{in}^i\) and \(c_\mathrm{out}^i\) represents the number of added edges at partition i computed by Eq. 4, \(\alpha _i, \beta _i \ge 0\) and \(\alpha _i, \beta _i \in \mathbb {N}\):

$$\begin{aligned} c_\mathrm{in}^i = \sum \vert v_j - \varDelta _i \vert : v_j \in s_\mathrm{in}^i, \end{aligned}$$
(4)

where \(\varDelta _i = \max \{deg(v_j) : v_j \in s_\mathrm{in}^i\}\). In order to simplify the equation and the calculations, we consider only the different sizes of \(s_\mathrm{in}^i\) and \(s_\mathrm{out}^i\), which are denoted by \(a_i\) and \(b_i\), respectively. We will denote \(\sum _{i=1}^p c_\mathrm{in}^i - \sum _{i=1}^q c_\mathrm{out}^i\) as R. Then, we can obtain the following equation from Eq. 3:

$$\begin{aligned} \sum _{i=1}^{p'} \alpha _ia_i + R = \sum _{i=1}^{q'} \beta _ib_i, \end{aligned}$$
(5)

where \(p' < p\) and \(q' < q\), since we are taking out the repeated values of \(\vert s_\mathrm{in}^i \vert \) and \(\vert s_\mathrm{out}^i \vert \). For the same reason, the values of \(\alpha _i,\beta _i\) in Eq. 5 are different from the values in Eq. 3.

Recall that in optimal micro-aggregation, \(k\le \vert s_\mathrm{in}^i \vert , \vert s_\mathrm{out}^i \vert \le 2k-1\) for all \(i\le \max (p,q)\). Hence, \(k\le a_i, b_i\le 2k-1\) for all \(i\le \max \{p',q'\}\). If we assume that \(\beta _{i_0}\not =0\) for a given \({i_0}\), then we obtain the equation:

$$\begin{aligned} \sum _{i=1}^{p'} \alpha _ia_i + R - \sum _{i\not ={i_0}} \beta _ib_i= \beta _{i_0}b_{i_0} \end{aligned}$$
(6)

Therefore, a solution can be obtained by solving the following equation:

$$\begin{aligned} \sum _{i=1}^{p'} \alpha _ia_i + R - \sum _{i\not ={i_0}} \beta _ib_i \equiv 0\ (\text {mod }b_{i_0}) \end{aligned}$$
(7)

Now, since we are working with congruences (mod \(b_{i_0}\)) we can consider the coefficients \(\alpha _i,\beta _i\) to be less than \(b_{i_0}\), which gives a large reduction of our search space for the solutions. In the worst case, we can obtain a solution by brute force, considering all the combinations of \(\alpha _i,\beta _i \le b_{i_0}\) which would be a search in \(O(k^{k})\), since \(b_{i_0} \le 2k\). Moreover, in practice we can find solutions to Eq. 7 much faster. In all the sequences, we have studied, \(\alpha _i=0\) for all i. While, for some \({i_1}\not ={i_0}\) and \(\beta _{i_1}\) the congruence \(R - \beta _{i_1}b_{i_1} \equiv 0\ (\text {mod}\,b_{i_0})\) was verified, so it is enough in most cases to consider only one variable \({i_1}\not ={i_0}\).

4.2 Step II: Graph modification

As mentioned earlier, our algorithm is based on adding fake arcs. Other methods anonymize the graph’s structure by adding and removing arcs, instead of additions only. In our approach, we consider keeping arcs of the original network, since true relations between users can be important for clustering or other graph-mining tasks. The authors of [9] empirically proved that edge addition is the best method to keep graph’s properties when perturbing scale-free networks, which constitute the most common type of real-world networks.

Fig. 2
figure 2

Illustration of edge addition, switch and extension processes. Solid lines represent new edges to be added and dashed lines existing edges to be deleted. Vertex color indicates whether a vertex changes its degree (dark gray) or not (light gray) after the edge modification has been carried out. a Edge addition, b edge switch, c edge extension

Once we have computed the k-degree anonymous in-degree and out-degree sequences, our approach computes the vector of differences between the original and anonymous sequences. That is, \(\delta _\mathrm{in} = d_\mathrm{in}^k - d_\mathrm{in}\) and \(\delta _\mathrm{out} = d_\mathrm{out}^k - d_\mathrm{out}\). Each vector clearly shows which vertices have to increase their in-degree (\(\delta _\mathrm{in}\)) and out-degree (\(\delta _\mathrm{out}\)). For each of them, we use three edge modification processes to increase the in- and out-degree of vertices in \(\delta _\mathrm{in}\) and \(\delta _\mathrm{out}\), respectively, which are the following:

  1. 1.

    Edge addition randomly chooses a combination of vertices which satisfies \((v_i,v_j) \not \in A\), where \(v_i \in \delta _\mathrm{out} : \delta _\mathrm{out}(v_i)>0\) and \(v_j \in \delta _\mathrm{in} : \delta _\mathrm{in}(v_j)>0\). The out-degree of vertex \(v_i\) and the in-degree of \(v_j\) both increase, as shown in Fig. 2a.

  2. 2.

    Edge switch occurs between four vertices \(v_i, v_j, v_k, v_p \in V\) where \((v_i, v_j),(v_k, v_p) \in A\) and \((v_i, v_p),(v_k, v_j) \not \in A\). It is defined by deleting arc \((v_k, v_p)\) and adding new arcs \((v_i, v_p)\) and \((v_k, v_j)\), as Fig. 2b illustrates. Note that, the out-degree of vertex \(v_i\) and the in-degree of vertex \(v_j\) will increase by 1, while other vertices’ degree will remain the same.

  3. 3.

    Edge extension exists between three vertices \(v_i, v_k, v_p \in V\), where \((v_k, v_p) \in A\) and \((v_k, v_i),(v_i, v_p) \not \in A\). Arc \((v_k, v_p)\) is deleted and new arcs \((v_k, v_i)\) and \((v_i, v_p)\) are created, as Fig. 2c illustrates. Note that the in- and out-degree of vertex \(v_i\) increases, while auxiliary vertices’ degree remains the same.

The process is described in Algorithm 1. For each vertex \(v_i \in \delta _\mathrm{out}\), the algorithm finds \(v_j \in \delta _\mathrm{in}\) and adds an arc between them. Due to the edge sparsity of real networks, this process is possible in several cases. However, in some cases it is not possible to create a fake edge as described previously. Then, we propose to use edge switch or edge extension to alter graph’s structure to fulfill the anonymous degree sequences. It may be the case that the order of adding edges may be relevant, as in Fig. 3. Suppose that the anonymous sequence should be \(\delta _\mathrm{out} = (2,2,3)\) and \(\delta _\mathrm{in} = (2,2,3)\), and the algorithm added the edges \((v_1, v_4)\) and \((v_2, v_5)\) first, as in (b). In this case, it will not be possible to apply any of our three edge modification processes to add one to the degrees of nodes \(v_3\) and \(v_6\); however, by adding the correct edges the sequence could have been obtained with our edge modification processes, as in (c). Notice that, we have never encountered such situation in our experiments possibly because of the sparsity of social networks, and due to the fact that our algorithms choose the added edges at random—arriving at such situation that no possible edge can be added, will only require to rerun the algorithm to avoid it.

Fig. 3
figure 3

Order for adding edges may be relevant. a Original graph, b adding \((v_1, v_4)\) and \((v_2, v_5)\), c adding \((v_1, v_6)\), \((v_2, v_4)\) and \((v_3, v_5)\)

figure e

5 Experimental framework

In this section, we will describe the experimental framework we have used to analyze and compare the information loss induced by our anonymization methods. For each dataset, we compute the Paired and Independent k-degree anonymous networks considering different values of k in the range [1, 10]. Notice that, \(k=1\) corresponds to the original network. Independent \((k_i,k_o)\)-degree anonymous networks are evaluated in the range of \(k_{i} \in [1,10]\) and \(k_{o} \in [1,10]\); this implies a total of 100 anonymous networks for each dataset.

5.1 Description of network datasets

We have used five standard and well-known real networks to test our methods: (1) Polblogs [1], a network of hyperlinks between weblogs on US politics; (2) UC-Irvine [28], which contains messages sent between the users of an online community of students from the University of California, Irvine; (3) Wiki-vote [26] (Wikipedia vote network) contains all the Wikipedia voting data from the inception of Wikipedia till January 2008, where vertices in the network represent wikipedia users and a directed edge from node \(v_i\) to node \(v_j\) represents that user i voted on user j; (4) DBLP-cite [25] is the citation network of DBLP, a database of scientific publications such as papers and books, where each vertex is a publication and each edge represents a citation of a publication by another publication; and (5) Epinions [30] is a who-trust-whom online social network of a general consumer review site Epinions.com, where members of the site can decide whether to “trust” each other. We have selected these datasets because they have diverse statistics and properties, as shown in Table 1. We have removed loops and multiple edges from all analyzed networks.

Table 1 Datasets used in this study

5.2 Information loss evaluation

In this part, we describe the criteria that are used to quantify the information loss that is introduced by our anonymization models. Following the approach presented in [8], we use diverse structural measures which are strongly or moderately correlated with clustering-specific processes. We claim that, by choosing those measures, our results will be applicable not only to graph’s properties but also to clustering and community detection processes. The first graph structural measure is the average distance (\(\overline{dist}\)), which is defined as the average of the distances between each pair of vertices in the graph. Diameter (d) is defined as the largest minimum distance between two vertices in the graph, and edge intersection is the percentage of original arcs which are also present in the perturbed version of the graph, i.e., \(EI(G,\widetilde{G}) = \frac{\vert A \cap \widetilde{A} \vert }{\max (\vert A \vert ,\vert \widetilde{A} \vert )}\). The above measures evaluate the entire graph as a unique score. We compute the error on these graph metrics as follows:

$$\begin{aligned} \epsilon _{m}(G,\widetilde{G}) = \vert m(G) - m(\widetilde{G}) \vert \end{aligned}$$
(8)

where m is one of the graph metrics defined above, G is the original graph, and \(\widetilde{G}\) is the k-anonymous graph.

The following metrics evaluate specific structural properties for each vertex of the graph: the first one is betweenness centrality (\(C_B\)), which measures the fraction of the shortest paths that go through each vertex. The second one is closeness centrality (\(C_C\)), and it measures how many steps are required to access every other vertex from a given vertex. We refer to \(C_C^-\) when the in-degree is considered and \(C_C^+\) in case of considering the out-degree. Finally, we use the degree centrality (\(C_D\)), which evaluates the centrality of each vertex based on its degree, i.e., the fraction of vertices connected to it. Similarly, \(C_D^-\) refers to in-degree and \(C_D^+\) to the out-degree of each vertex. We compute the error on vertex metrics by:

$$\begin{aligned} \epsilon _{m}(G,\widetilde{G}) = \sqrt{\frac{1}{n} \big ( (g_{1} - \widetilde{g}_{1})^{2} + \cdots + (g_{n} - \widetilde{g}_{n})^{2} \big )}, \end{aligned}$$
(9)

where \(g_{i}\) and \(\widetilde{g}_{i}\) are the values of the metric m for vertex \(v_i\) of G and \(v_i\) of \(\widetilde{G}\), respectively.

5.3 Clustering-specific evaluation

Variations in the generic graph properties are a good way to assess the information loss, but they have their limitations because they are just a proxy for the changes in data utility we actually want to measure. We define the specific information loss measures as a task-specific measure for quantifying the data utility and the information loss associated to a data publishing process. We focus on clustering-specific processes, due to their importance in networks arising from diverse applications, including social, biological and healthcare networks. Similar to generic graph measures, we compare the results obtained both by the original and the perturbed data in order to quantify the level of noise introduced in the perturbed data. This measure is specific and application-dependent, but it is necessary to test the perturbed data in real graph-mining processes.

Fig. 4
figure 4

Framework for evaluating the clustering-specific information loss measure

We consider the following approach to measure the clustering assessment for a particular perturbation and clustering method: (1) apply our k-degree anonymity algorithms to the original graph G and obtain \(\widetilde{G}\); (2) apply a particular clustering method c to G and obtain clusters c(G) and then apply the same method to \(\widetilde{G}\) to obtain \(c(\widetilde{G})\); (3) compare the clusters c(G) to \(c(\widetilde{G})\) as shown in Fig. 4. With respect to information loss, it is clear that the more similar \(c(\widetilde{G})\) is to c(G), we have the less information loss. Thus, clustering-specific information loss measures should evaluate the divergence between both sets of clusters c(G) and \(c(\widetilde{G})\).

Ideally, the results should be the same, that is, the same number of sets (i.e., clusters) with the same elements in each set. In this case, we can say that the anonymization process has not affected the clustering process. When the sets do not match, we should be able to calculate a measure of divergence. For this purpose, we use the precision index [6]. Assuming that we know the true communities of a graph, the precision index can directly be used to evaluate the similarity between two cluster assignments. Given a graph of n nodes and q true communities, we assign to nodes the same labels \(l_{tc}(\cdot )\) as the community they belong to. In our case, the true communities are the ones assigned on the original dataset (i.e., c(G)), since we want to obtain communities as close as the ones we would get on non-anonymized data—we are not interested in the ground truth communities. Assuming that the perturbed graph has been divided into clusters (i.e., \(c(\widetilde{G})\)), then for every cluster, we examine all the nodes within it and assign to them as predicted label \(l_{pc}(\cdot )\) the most frequent true label in that cluster (basically the mode). Then, the precision index can be defined as follows:

$$\begin{aligned} precision(G,\widetilde{G}) = \frac{1}{n} \sum _{i=1}^{n} \delta _{l_{tc}(v_i), l_{pc}(v_i)}, \end{aligned}$$
(10)

where \(\delta \) is the Kronecker delta function, i.e., \(\delta _{x,y}\) equals 1 if \(x=y\) and 0 otherwise. Note that the precision index is a value in the range [0, 1], which takes the value 0 when there is no overlap between the sets and the value 1 when the overlap between the sets is complete.

We have considered two different graph clustering algorithms to evaluate the anonymization process. Both them are unsupervised algorithms based on different concepts and developed for different applications and scopes. The selected clustering algorithms are: (1) Infomap [32] that optimizes the map equation, which exploits the information-theoretic duality between the problem of data compression and the one of detecting significant structures in the graph; and (2) Walktrap [29] that finds densely connected subgraphs via random walks.

6 Information loss and data utility

In this section, we present the results of our anonymization techniques in terms of data utility and information loss. Generic information loss measures, which are based on several graph’s properties, will be described in the next section, while information loss regarding clustering-specific graph-mining tasks will be analyzed in subsequent section.

6.1 Generic information loss evaluation

In this subsection, we compare the results of anonymizing several networks using our models and algorithms for k-degree anonymity on directed networks. Specifically, we will use DGA for Paired and Independent k-degree anonymity. We apply both algorithms on the same data with the same parameters and compare the results in terms of information loss and data utility. It is important to note that the privacy level achieved for both algorithms is similar, but not the same. As we have discussed previously, Paired k-degree anonymity is a stronger model than the one of Independent k-degree anonymity. However, the former or the latter method could be of interest depending on the dataset and the application scenario. Unfortunately, we cannot compare our methods to other k-degree anonymity algorithms, due to the fact that our work is the first that considers k-degree anonymity models specifically designed for directed networks.

Table 2 Results for Paired k (P-k) and Independent k (I-k) degree anonymity over 10 levels of anonymization on DBLP-cite dataset

Firstly, an in-depth analysis of generic information loss measures on DBLP-cite network will be performed. We propose a detailed study of how generic information loss measures evolves during the anonymization process. Then, we present the same analysis on the other four networks, but skipping details due to the space constraints.

The results are shown in Table 2. Each row indicates the scored value for the corresponding measure and method, and each column corresponds to an experiment with a different k-anonymity value. For each dataset and method, we vary k from 1 to 10 (k=1 corresponds to the original dataset) and compare the results obtained on EI, EA, \(\overline{dist}\), d, \(C_B\), \(C_C^{-}\), \(C_C^{+}\), \(C_D^{-}\) and \(C_D^{+}\). The last column corresponds to the average error \(\overline{\epsilon _{m}}\). Each characteristic is reported two times, corresponding to Paired and Independent k-degree anonymity. Clearly, the lower the information loss, the better the method. Perfect performance in a row would be indicated by achieving exactly the same score as in the original network (the k=1 column). Although deviation is undesirable, it is inevitable due to the edge modification process. Complementary information is introduced in Fig. 5, where we can see graphical details about the behavior of different models during anonymization process (those are the same values that we have reported in Table 2). Additionally, information regarding degree distribution is provided in Fig. 5a, where several nodes do not fulfill k-degree anonymity both on in- and out-degree.

Fig. 5
figure 5

Generic and clustering information loss evaluation on Paired k and Independent k-degree anonymity on DBLP-cite dataset. a Degree distribution, bEI, cEA, d\(\overline{dist}\), ed, f in-degree \(C_C\), g out-degree \(C_C\), h in-degree \(C_D\), i out-degree \(C_D\), j\(C_B\), kInfomap, lWalktrap

The first two rows in Table 2 correspond to edge intersection and edge addition. Edge intersection is the percentage of edges on the anonymous networks which are also present in the original network. Figure 5b shows that this metric is linear on the k value on both methods. It is important to underline that more than 90% of the arcs in \(\widetilde{A}\) are also present in A. The next metric is closely related to this one, and the behavior is similar, as depicted in Fig. 5c. Edge addition indicates the number of arcs added to anonymize the network. The difference between these metrics rely on the edge switch and edge extension operations, which can modify some arcs to fulfill the anonymous degree sequences. We note that usually the number of arcs added by Independent k-degree is half the number of Paired k-degree. Average distance is pointed out in the third row. The value of the original network (denoted by \(k=1\)) is 5.427. Thus, values close to this one indicate low information loss. Although both methods achieve good results, values of Independent k-degree anonymity remain closer to the original one over all anonymization range than values of Paired k-degree, as can be seen in Fig. 5d. Indeed, the average error computed over all range is 0.996 for Paired k-degree and 0.439 for Independent k-degree. A similar behavior can be seen within the second metric, diameter (see Fig. 5e).

The centrality measures are computed by Eq. 9, as described previously. Therefore, the values in Table 2 are 0 for the original network, i.e., k=1. Clearly, the lower the value, the better the performance of the corresponding method. Betweenness centrality is an important measure for some clustering and community detection algorithms. We remark that the error remains almost stable for values of \(k \ge 3\), as can be seen in Fig. 5j. Figure 5f, g depict the in- and out-degree closeness centrality, respectively. We can observe that these values remain low, except from the out-degree values of Paired k-degree anonymity which present slightly high values. Finally, the in- and out-degree centrality measures, which are depicted in Fig. 5h, i, indicate very similar values and behavior, independently of the method used to anonymize the network. We should note here that the results shown in Fig. 5 and Table 2 correspond to a single run of the proposed models. As we can observe, some data utility criteria are not monotonous with respect to the anonymization level k. For example, in Fig. 5e, the diameter of the graph presents a spike at \(k=8\), which turns out to be an effect of the edge modification process (edge switch and edge extension).

Table 3 presents the same analysis on the other tested networks. However, due to space constraints, only the average error (last column of Table 2) is pointed out for each metric, method and network. Polblogs is a medium size network with some important hubs. Hence, anonymization process is harder than other networks, adding an average of 19% of total arcs to fulfill Paired k-degree anonymity. When Independent K-degree is considered, less than 5% of arcs have to be added. Nonetheless, the average distance and diameter show relatively small distortion after the anonymization process. On the contrary, Epinions is our largest network and the results are very encouraging. Less than 6% of the total number of arcs need to be added to fulfill Paired k-degree anonymity. The same value reduces to less than 2% in case of Independent k-degree anonymity. Moreover, the average distance and diameter show very small perturbation. Indeed, there is no perturbation in diameter using Independent k-degree anonymity.

Table 3 Results for Paired k (P-k) and Independent k (I-k) degree anonymity over 10 levels of anonymization, and Independent \((k_i,k_o)\) (I-(\(k_i\), \(k_o\))) degree anonymity over 100 levels of anonymization

In the previous paragraph, we have considered and compared Independent k-degree when \(k=k_i=k_o\). This is an specific case, but also the most probable, interesting and useful one. Nevertheless, we want to analyze the general case, i.e., Independent \((k_i,k_o)\)-degree anonymity where \(k_i \ne k_o\). In the following experiments, we will consider all possible combinations of \(k_i,k_o \in [1,\ldots ,10]\) on Polblogs, which implies 100 anonymous datasets. Note that, 10 of these datasets are the same in Independent k-degree anonymity.

Results of Independent \((k_i,k_o)\)-degree anonymity are depicted in Table 3 (third row) and in Fig. 6. As it can be seen, the average error of Independent \((k_i,k_o)\)-degree anonymity on generic information loss measures remains higher than Independent k but lower than Paired k. It is interesting to underline that the number of edges added when \(k_i \approx 10\) or \(k_o \approx 10\) is very similar to the one when \(k_i, k_o \approx 10\), but the privacy level is not. Indeed, the best ratio between number of arcs added and privacy level is achieved when \(k_i = k_o\), as can be seen in Fig. 6a. The average distance (Fig. 6b) also shows that the error is greater when considering very different values of \(k_i\) and \(k_o\).

Fig. 6
figure 6

Independent \((k_i,k_o)\)-degree anonymity on Polblogs dataset. a Degree distribution, aEA, b\(\overline{dist}\), cInfomap, dWalktrap

Finally, we have considered a baseline comparison to our methods. It is a naïve approach based on converting the digraphs to undirected graphs, applying a k-degree anonymity algorithm (with k in the same range as the other methods) and transforming again to directed graphs to make the comparison fair. For this analysis, we have used the UMGA algorithm [10], which demonstrated to preserve data utility better than other k-degree-based methods.

We named this method Undirected k-degree anonymity (U-k), and its results are depicted in Table 3 (fourth row). It is important to underline that the privacy level achieved is the same as in the Paired k-degree anonymity, since the in- and out-degree of the anonymous graphs are the same. However, the error values are higher than the ones for Paired k on all metrics, except the diameter.

6.2 Clustering information loss evaluation

As we have stated previously, clustering-specific information loss measures are important to consider data utility and information loss on real graph-mining processes. Even though the generic information loss measures are a good way to assess the data utility, specific information loss measures can help us to quantify data utility and information loss associated to a data publishing process. The last two rows in Table 2 and the last two columns in Table 3 present the precision index computed using the Infomap and Walktrap algorithms. As we have previously commented, the precision index takes the value of zero when there is no overlap between the sets and the value of one when the overlap between the sets is complete.

Analyzing our in-depth experiment on the DBLP-cite network, we can point out that more than 94 (Paired k) and 97% (Independent k) are correctly clustered using Infomap after \(k=10\) anonymization process, as shown in Fig. 5k. Precision index average values are 0.97 and 0.985, respectively. Similar results can be seen for our other tested networks in Table 3. Precision index on Independent \((k_i,k_o)\)-degree anonymity on Infomap is shown in Fig. 6c. As in the previous experiments, Infomap seems to be more stable and less sensitive to data perturbation. The average precision index keeps close to 90% well-classified vertices using both clustering algorithms. These results allow us to claim that data utility is preserved using our methods to anonymize directed networks.

Lastly, it is interesting to point out that the precision index achieved using an undirected k-degree approach (U-k) is far worse than the methods specifically developed to deal with directed networks when considering the clustering-specific information loss.

Summarizing the results, it is interesting to stress out that both methods achieve good results on generic and clustering-specific information loss measures. Specifically, Independent k-degree anonymity gets the lowest average error on all analyzed metrics and datasets and the highest precision index values on all clustering algorithms. It is also important to underline that although Paired k-degree anonymity imposes the strongest privacy levels, it achieves very good results on all analyzed metrics.

7 Performance and scalability

In this section, we aim to improve the scalability of the proposed methods toward being able to anonymize large-scale directed networks. To this direction, we have applied two preprocessing techniques for obtaining the k-anonymous degree sequences. The first one, which is tailored for the Independent (\(k_i\), \(k_o\))-degree anonymity model, is based on a lossless representation of the degree sequences \(d_\mathrm{in}\), \(d_\mathrm{out}\) with a considerable reduction in size [34].

For the case of Paired k-degree anonymity, since the micro-aggregation technique is not scalable (e.g., the MDAV algorithm has time complexity proportional to \(O(n^2)\)), we have applied k-means as a partitioning method (preprocessing step of O(n) complexity). In particular, we have used Lloyd’s algorithm in a hierarchical way to obtain partitions of manageable size.

More precisely, we start by applying k-means to obtain a partition for the entire dataset. If the parts are not small enough (smaller than a threshold s), we further apply k-means on each of them, until we satisfy the required threshold. Notice that our solution could have been executed in parallel, yielding an even faster algorithm in practice. The method is presented in Algorithm 2.

figure f

In order to examine the scalability of our methods, we have used two large-scale real networks. The first one is the DBLP-2006,Footnote 2 which corresponds to the co-authorship network of the DBLP computer science bibliography in 2006. The second one, Pokec [37], is the most popular online social network in Slovakia. Table 4 provides the main properties of these networks. All the experiments reported in this section have been performed on a 4 CPU Intel Xeon X3430 at 2.40GHz with 32GB RAM, running Debian GNU/Linux.

Table 4 Large datasets used to test the scalability of our methods
Table 5 Results for Paired k (P-k) and Independent k (I-k) degree anonymity for \(k \in \{10,20,50,100\}\)

Table 5 depicts the results of the scalability experiments. For each network and method, we have considered values of \(k \in \{10,20,50,100\}\). As a summary of the first step of our method, we provide the computation time (in secs.) and the number of new arcs that is needed to create a k-degree anonymous sequence. Regarding the second step of our method, we report the computation time, as well as the number of edge addition, switch and extension that is performed.

As it can be seen in Table 5, the computation time of degree sequence anonymization (step 1) is negligible compared to the one of graph modification process (step 2). Consequently, the Paired k model is more time-consuming compared to the Independent k model, mainly due to the fact that the Paired k involves the insertion of more new edges in order to reach the desired privacy level. Figure 7a also shows that the total running time grows linearly with respect to the value of k. Additionally, the number of arcs that need to be added also grows linearly with the value of k, as shown in Fig. 7b. Finally, as depicted in Fig. 7c, the number of edge additions, swaps and extensions grows proportionally while the value of k increases.

However, the computation time needed by the Paired and Independent k anonymity models is quite similar in the DBLP-2006 graph. Although the number of arcs that need to be created is much lower in the Independent k model, the number of edge switch operations is higher; edge switches, along with edge extensions, are more time-consuming compared to edge additions which are computationally easy to be performed.

Finally, regarding data utility and information loss, we underline that the preprocessing technique on Independent (\(k_i\), \(k_o\))-degree anonymity model preserves the quality of the solution, as demonstrated by authors in [34]. On the contrary, the preprocessing technique on Paired k-degree anonymity can slightly reduce the quality of the solution compared to the case where no preprocessing step is applied. We measure this divergence as the number of added extra arcs to the k-anonymous degree sequences, which is between 0.1 and \(1.5\%\) according to our experiments on DBLP-2006. Specifically, the values are 0.13%, 0.33%, 0.93% and 1.57% for \(k \in \{10, 20, 50, 100\}\).

8 Conclusions

In this paper, we have defined two different k-degree anonymity models specifically designed for directed networks. Furthermore, we have introduced different algorithms to achieve the desired privacy levels, based on the proposed models. An empirical evaluation of these models have been conducted on several real networks, comparing information loss based on different graph properties and also on clustering-specific criteria. We have demonstrated that our anonymization models aim to reduce information loss, while simultaneously retain data utility. As we have seen throughout our experimental framework, the Independent k-degree anonymity model demands fewer edge additions and switches in order to meet the desired privacy level. Nevertheless, the Paired k-degree model gives good results in several generic information loss measures and also achieves excellent precision index values in our clustering-specific information loss framework. Furthermore, we have demonstrated that our edge modification technique is scalable to large-scale networks.

Fig. 7
figure 7

Details of our experiments on Pokec network. The horizontal axis represents the k-degree anonymity value. a Computation time, b new arcs, c detail of new arcs

Many interesting directions for future research have been uncovered by this work. Firstly, a deeper analysis of the Independent (\(k_i\), \(k_o\))-degree anonymity model have to be performed in order to better understand how these parameters can be used according to network’s specific properties in order to achieve good privacy levels, while preserving the underlying graph structure. Secondly, it would be thought-provoking to also consider edge deletion in order to better preserve data utility. Moreover, other information loss measures based on real graph-mining processes can be considered, such as information flow. Lastly, it would be also very interesting to extend those models to other rich graph types, including weighted, signed and multilayer [31] directed networks.