Keywords

1 Introduction

Collecting data and sharing them for secondary analysis is increasingly widespread and brings undoubted social and economic benefits. Yet, when data are personally identifiable information (PII), sharing them may be a threat to people’s privacy. As a consequence, administrations have strengthened privacy regulation to protect the citizens. In a nutshell, these new privacy regulations, epitomized by the EU General Data Protection Regulation, require consent from data subjects for any PII collection, sharing or analysis. In the many situations in which obtaining consent is not feasible, anonymization is the only way to go. After anonymization, data no longer qualify as PII and, thus, are no longer subject to data protection regulations.

Anonymizing data involves not only suppressing any identifiers, but altering other attributes as well. The original data are first stripped from identifiers and then a statistical disclosure control method is used to mask the remaining attributes so that they no longer reveal information about original data subjects. Masking is not straightforward because, to keep the masked data statistically valid, the information loss must be minimized. Among the available statistical disclosure control techniques, in this paper we focus on microaggregation. Microaggregation replaces records in the original data set by (aggregated) records that refer to groups of data subjects. The greater the groups, the stronger the protection. To guarantee at least a certain level of protection, microaggregation algorithms take a parameter k that determines the minimum required group size.

In recent years, the research on data anonymization performed by the computer science community has focused on privacy models. A privacy model describes the condition that data must satisfy for disclosure risk to be at an acceptable level, but it does not describe how this condition should be attained. k-Anonymity [15] is among the most popular privacy models. It seeks to limit the probability of successful record re-identification by altering the value of quasi-identifier attributes. Quasi-identifiers are attributes that are not re-identifying when separately considered (e.g. in general Age, Profession and Zipcode do not identify anyone separately), but such that their combination may identify the subject to whom a record corresponds (there may be a single 95-year old doctor in a certain zipcode, and it may be easy to find her name in an electoral roll). Interestingly, running microaggregation on the quasi-identifiers yields k-anonymity [8]. Microaggregation is also useful to enforce l-diversity and t-closeness, two extensions of k-anonymity [7, 19], as well as a building block of \(\varepsilon \)-differentially private algorithms [17, 18].

To minimize the information loss incurred by microaggregation, we need to carefully choose the groups of records to be aggregated. A common approach in numerical microaggregation is to attempt to minimize the sum of squared distances between original records and their corresponding aggregated records, which will be called SSE. Unfortunately, finding a microaggregation that minimizes SSE is an NP-hard problem. For this reason, existent approaches are heuristic. Most current microaggregation algorithms generate clusters with a fixed size (the minimum required cluster size). This cardinality constraint reduces the complexity of the microaggregation algorithm but it may result in large information loss. To reduce information loss, heuristic variable-size microaggregation algorithms have been proposed, but their computational complexity is greater than that of their fixed-size counterparts. Also, in some cases they need additional parameters whose optimal values are hard to determine.

Contribution and Plan of this Work

Microaggregation is closely related to clustering: in fact, it is clustering with a minimum cardinality constraint on clusters. In this work, we take advantage of the information loss minimization capabilities of Lloyd’s clustering algorithm [12] to achieve near-optimal variable-size microaggregation. First, we embed a minimum cluster size constraint in the algorithm. Second, given that Lloyd’s algorithm requires the number of clusters to be fixed beforehand, we modify it to allow a variable number of clusters. We call the resulting heuristic ONA (Near-Optimal microaggregation Algorithm). We then present empirical results on the information loss and the computing time of variable-size microaggregation with ONA.

In Sect. 2, we give some background on microaggregation and Lloyd’s algorithm. In Sect. 3, we describe some limitations of current microaggregation algorithms. In Sect. 4 we present the ONA algorithm to deal with these limitations. In Sect. 5, we experimentally compare ONA with existing methods. We finalize with conclusions and future work directions in Sect. 6.

2 Background

2.1 Microaggregation

Microaggregation is a perturbative method for statistical disclosure control of microdata releases. It is based on the following two steps:

  • Partition: The records in the original data set are partitioned into several clusters, each of them containing at least k records (the minimum cluster size). To minimize information loss in the following step, records in each cluster should be as close to one another as possible.

  • Aggregation: An aggregation operator is used to compute the centroid of all the records in the cluster. If all attributes are numerical, the centroid record is the mean record. Finally, every record in the cluster is replaced with the cluster centroid record.

When replacing records by cluster centroids in the aggregation step of microaggregation, some information is lost. The ensuing loss of variability is a measure of information loss. A microaggregation algorithm is optimal if it minimizes information loss.

Let SST be the total sum of squares, that is, the sum of squared distances between each record r in an original data set D and the centroid record c(D) of the entire data set:

$$ SST=\sum _{r\in D}\left\| r-c(D)\right\| ^{2}. $$

Clearly, SST represents the total variability of D. Then compute the sum of squared records errors SSE, that is, the sum of squared distances between each record r and the centroid c(r) of the cluster r belongs to:

$$ SSE=\sum _{r\in D}\left\| r-c(r)\right\| ^{2}. $$

SSE represents the loss of variability incurred when replacing records with centroids. We can normalize SSE by dividing it by SST, so that SSE / SST accounts for the proportion of the total variability lost due to the microaggregation. With numerical attributes, the mean is a sensible choice as the aggregation operator, because for any given cluster partition it minimizes SSE in the aggregation step; the challenge thus is to come up with a partition that minimizes the overall SSE.

Finding an optimal algorithm is feasible for univariate microaggregation of a numerical attribute. There are two well-known necessary optimality conditions in this case [4]: clusters must contain consecutive records and the size of the clusters must be between k and \(2k-1\). Given these two conditions, a shortest-path algorithm can find the optimal univariate microaggregation with cost \(O(n \log n)\) for n records [9].

Since realistic data sets contain multiple attributes, univariate microaggregation is not enough. Multivariate microaggregation is more complex: the first optimality condition above does not apply for want of a total order in the data domain. As a result, the search space for the optimal multivariate microaggregation remains too large and finding the optimal solution is NP-hard [14]. Therefore, heuristics are employed to obtain an approximation with reasonable cost. An example heuristic for the partition step of microaggregation is MDAV [8], which generates fixed-size clusters. Alternatively, VMDAV [16] is an adaptation of the MDAV heuristic that allows variable-size clusters.

2.2 MDAV

The MDAV algorithm aims at satisfying the optimality conditions of numerical univariate microaggregation:

  1. 1.

    Optimal clusters must contain consecutive elements. Since a total order is lacking in a multivariate domain, the meaning of consecutive elements is not well-defined. However, the intuition remains valid: it makes no sense to include a record \(r'\) in a cluster if a record r closer to the records of the cluster is not in the cluster.

  2. 2.

    The size of optimal clusters ranges between k and \(2k-1\). This condition remains valid in the multivariate case.

Thus, rather than minimizing the overall information loss, the MDAV heuristic proceeds by selecting specific records at the boundary of the set of records not yet assigned to any cluster and generating clusters of k elements around them: given a record r, a cluster is formed with r and the \(k-1\) records closest to r among those not clustered yet. See Algorithm 1.

figure a

2.3 VMDAV

VMDAV is an adaptation of MDAV that can yield variable-size clusters. The underlying idea is that variable-size clusters can be more adapted to the distribution of the records and, thus, reduce the information loss.

Essentially, VMDAV takes two steps: (i) generate a cluster of size k that contains the record that is farthest from the average record and its closest \(k-1\) records, and (ii) expand the cluster with neighboring records. These steps are repeated until all the records have been assigned to a cluster.

The first step is similar to MDAV. So we only describe the second step. Once we have a cluster with k records, we look for \(r_{u}\), the unclustered record that minimizes the distance to the records in the cluster. Let \(d_{in}\) be such minimum distance. The we compute \(d_{out}\), the minimum distance between \(r_{u}\) and the remaining unclustered records. The cluster expansion procedure is based on these two distances. If \(d_{in}\) is smaller than \(d_{out}\), then \(r_{u}\) is closer to the records in the cluster than to the other unclustered records. In that case, adding \(r_{u}\) to the current cluster is a sensible choice. To allow tuning cluster expansion, VMDAV introduces a threshold parameter \(\gamma \), so that the current cluster is expanded with \(r_u\) if \(d_{in}<\gamma d_{out}\).

2.4 Clustering and Lloyd’s Algorithm

There are several approaches to generate clusters. In this work, we are interested in centroid-based clustering (a.k.a. c-means clustering). The purpose of c-means is to split the records in a fixed set of c clusters in a way that SSE is minimized.

Lloyd’s algorithm is designed for c-means clustering. Starting from an arbitrary set of c centroids, the algorithm proceeds by iteratively assigning each record to the closest centroid and recomputing the centroids, until a convergence criterion is met. See Algorithm 2.

The runtime of Algorithm 2 is O(ncdi), where n is the number of records, c is the number of clusters, d is the number of attributes per record and i the number of iterations needed until convergence. Lloyd’s algorithm is thus often considered of linear complexity in practice, although in the worst case it can be superpolynomial.

figure b

3 Limitations of MDAV and VMDAV

MDAV is quite effective at generating clusters that are as compact as possible: it looks for the record that is farthest from the average record and then generates a cluster that contains it and the \(k-1\) records closest to it. In this way MDAV creates compact clusters and avoids the presence of intersecting clusters, which are undesirable because their records could be rearranged in non-intersecting clusters, thereby reducing information loss. The greatest limitation of MDAV is that all clusters (except perhaps the last one) have fixed size k. This is much more restrictive than the optimality condition according to which cluster cardinality must be between k and \(2k-1\), and it may have a significant negative impact on information loss. This limitation not only affects MDAV but all microaggregation methods that use fixed-size clusters.

VMDAV improves over MDAV by being more flexible about cluster sizes. However, the cluster expansion criterion is difficult to adjust. VMDAV uses an extra threshold parameter \(\gamma \) to decide between expanding the current cluster with an additional element (up to a maximum \(2k-1\) elements) or creating a new cluster. The difficulty comes from the fact that it is not known how to fix \(\gamma \) appropriately.

In [16], we find some vague recommendations, which suggest the use of large thresholds (e.g. \(\gamma =1.1\)) when records are concentrated around specific areas of the data domain, whereas smaller thresholds (e.g. \(\gamma =0.2\)) are preferable when records are scattered. The rationale for the rule that recommends the use of small \(\gamma \) for scattered records is clear: in this case, small clusters are preferable to avoid large SSE. However, we should keep in mind that by using small \(\gamma \) the cluster expansion mechanism is hampered, and VMDAV becomes closer to MDAV. The rationale for using large \(\gamma \) when records are concentrated around specific points is unclear to us. After all, regardless of the distribution of records, we should prefer smaller clusters to larger clusters. This is illustrated in Fig. 1, where two microaggregation partitions with minimum size \(k=3\) are displayed that could be obtained using VMDAV. On the left, all clusters have size 3, which is a result compatible with VMDAV for small \(\gamma \) (and also with MDAV). On the right, the size of the clusters is greater than 3, which is compatible with VMDAV for large \(\gamma \). By looking at the distribution of the records, we observe that they are concentrated around two points; thus, according to the rules suggested in [16] we would select a large threshold, which would make the right-hand side partition likelier. However, SSE and hence the information loss is larger for this partition than for the left-hand side partition.

The issues of VMDAV that we have hinted are confirmed in the experimental section, where VMDAV and MDAV achieve comparable levels of information loss. That is, the cluster expansion procedure of VMDAV is not capable of offering noticeable reductions in the information loss.

Fig. 1.
figure 1

Two microaggregation partitions with minimum size \(k=3\). Left, partition where all clusters have size 3. Right, partition where clusters have size greater than 3.

One justification for suggesting large \(\gamma \) when records are concentrated in different regions is to avoid obtaining clusters that expand across more than one region. On the left-hand side of Fig. 2, we show an example of this undesirable situation. This partition, where all clusters except one have size 3, could be the result of taking \(k=3\) in MDAV or in VMDAV with small \(\gamma \). Taking a large threshold in VMDAV is expected to facilitate variable-size clusters, which might solve the problem. However, as shown on the right-hand side of Fig. 2, it is not guaranteed that variable-size clusters achieve the required result: there is still a cluster spread among two regions.

Fig. 2.
figure 2

Clusters than expand across regions. Left, partition output by MDAV with \(k=3\) or by VMDAV with \(k=3\) and small \(\gamma \). On the right, partition output by VMDAV with large \(\gamma \), where cluster size can vary between \(k=3\) and \(2k-1=5\).

Even if the previous VMDAV threshold rules were effective for data sets that are clearly concentrated or scattered, we would still be at a loss for data sets that do not qualify as any of those two types. For example, consider a data set that has several small regions with concentrated records and a big region with scattered records.

Furthermore, in general it cannot be assumed that the data controller choosing anonymization parameters knows whether her data set is scattered, concentrated, etc. In fact, for large and high-dimensional data sets, it may be quite difficult to grasp how records are distributed in the domain of attributes.

In summary, fixed-size microaggregation incurs a large information loss and cluster expansion strategies such as those used in VMDAV are difficult to adjust.

4 ONA: Near-Optimal MicroAggregation

In this section we propose ONA (Near-Optimal microAggregation), a novel variable-size microaggregation method that is based on standard clustering algorithms. On the one hand, clustering algorithms adjust the size of each cluster automatically. We plan to take advantage of this property in ONA, while making sure that the size of the clusters stays within the known optimal bounds, that is, between k and \(2k-1\). On the other side, clustering algorithms usually take the number of clusters as a parameter. In microaggregation, we do not care about the number of clusters; we simply want a valid clustering that minimizes the information loss. Thus, the need to tell the microaggregation algorithm the number of clusters we want would be an artificial restriction that we prefer to avoid, both for the sake of algorithm clarity and to avoid unnecessary information loss.

ONA follows Lloyd’s online algorithm (see Algorithm 2) but it makes several adjustments to guarantee that an appropriate number of clusters with an appropriate size is generated. Algorithm 3 formalizes ONA and its steps are explained next:

  • We start (at line 3) by generating a random set of clusters whose cardinality is k or more. The minimum cardinality constraint of microaggregation is enforced by starting with a set of clusters that conforms to it and by making sure that any modification of the clusters does not violate it.

  • The proposed algorithm is iterative. Each iteration (lines 4–29) is designed to reduce the SSE of the clustering, until convergence is reached. The convergence condition is not specified in the algorithm. To be strict, we should require a completely stable set of clusters. However, as most of the reduction in SSE is attained in the first few iterations, it is usually safe to use less strict conditions to speed up the execution. We will describe alternative convergence conditions when reporting experiments in Sect. 5.

  • Following Lloyd’s online algorithm, loop through the records (lines 5–28) in the data set and reassign them (if needed) to the closest cluster so that SSE decreases.

  • It is only possible to reassign a record if its current cluster contains more than k records (lines 7–11). Otherwise, there would remain less than k records in the cluster and the clustering would not satisfy the minimum cardinality constraint. If the cluster of the current record has more than k records, remove the record from the cluster (line 9) and assign it to the closest cluster (line 11).

  • When the cluster of the current record has k records, the only way to reassign the current record to another cluster is to dissolve the cluster and reassign all its records to other clusters (lines 12–20). This is only done if it reduces SSE. In line 15 all reassignments are computed: \(C_{j(s)}\) is the cluster to which record s is reassigned. The contribution to SSE of the original clusters (\(SSE_{1}\), line 16) and the SSE of the reassigned clusters (\(SSE_{2}\), line 17) are computed. If \(SSE_{2}<SSE_{1}\), the reassignments are applied; otherwise, the current clustering is kept unmodified.

  • Finally, the algorithm checks that all clusters have at most \(2k-1\) records (as one of the optimality conditions requires). This condition must be checked because the reassignments can make clusters grow beyond \(2k-1\) records. If a cluster with 2k or more records is found, we apply the same Algorithm 3 to the cluster, which will split it into two clusters of size between k and \(2k-1\) thereby reducing SSE.

In spite of the distinction between the current cluster having more than k records or k records, the complexity of Algorithm 3 remains essentially the same as the one of Lloyd’s algorithm (see Sect. 2.4).

figure c

5 Experimental Evaluation

5.1 Evaluated Methods

The motivation of our algorithm has been based on the limitations of MDAV and VMDAV. However, for completeness, the experimental section will not be limited to comparing with those two methods. We will compare the information loss using SEE and \(100\times SSE/SST\) (as described in Sect. 2.1) for the following methods: MDAV [4], VMDAV [16], MD-MHM [3], MDAV-MHM [3], CBFS-MHM [3], NPN-MHM [3], \(\mu \)-Approx [6], M-d [10], TFRP-1 [2], TFRP-2 [2], DBA-1 [11], DBA-2 [11] and IMHM [13].

5.2 Data Sets

The evaluation was performed on data sets [1] that have been used in the literature to evaluate microaggregation algorithms:

  • Census. Data set with 1080 records and 13 numerical attributes.

  • Tarragona. Data set with 834 records and 13 numerical attributes.

  • EIA. Data set with 4092 records and 11 numerical attributes.

5.3 Evaluation Results

The evaluation results are shown in Table 1. We observe that, while there are only small differences in the information loss reported by other methods, our proposal achieves a significantly smaller information loss. This behavior is consistent across cluster sizes and data sets.

Table 1. Information loss \(100\times SSE/SST\) for several values of k and several data sets

The algorithm has been implemented in Java and the experiments have been run on a AMD Ryzen 1700X machine under Ubuntu 17.04 x64. Table 2 shows the runtimes of ONA for the various test data sets and cluster sizes. To compute these runtimes, we have used the strictest convergence criterion: we keep iterating until no more record reassignments take place. We should remark that the steepest SSE decrease takes place during the first few iterations. Thus, a less strict convergence condition could offer significantly shorter runtimes without a substantial difference in the SSE. Indeed, we have observed that the SSE reaches a stationary value long before the number of reassignments reaches 0.

Table 2. ONA runtimes in seconds for the test data sets and the tested cluster sizes.

6 Conclusions and Future Research

We have proposed ONA, a novel microaggregation algorithm that significantly reduces the information loss with respect to existent algorithms. ONA operates iteratively and is based on Lloyd’s clustering algorithm. Each iteration of ONA decreases the information loss until it converges to a (possibly local) minimum.

In the design of ONA, we have tried to match the two necessary conditions for optimal microaggregation as closely as possible. First, we make sure that each cluster contains only adjacent records. This is achieved by reassigning records to the cluster with the closest centroid. Second, we make sure that the size of clusters ranges between k and \(2k-1\). In record reassignments, we take care that a source cluster is never left with less than k records (otherwise we disband it) and that a destination cluster never increases to more than \(2k-1\) records (otherwise we split it into two clusters).

In the experimental section, we have presented an exhaustive comparison of the information loss with existent microaggregation algorithms. The results show that ONA offers a very significant reduction of the information loss. It is also important to remark that such a reduction is effected without resorting to complex procedures. Indeed, the internal operation of ONA is simpler than that of most of the microaggregation algorithms included in the comparison.

As future work, we plan to conduct a detailed analysis of the convergence conditions for ONA and also to extend it to categorical data. Currently, the range of microaggregation algorithms available for dealing with this kind of data is rather limited. The work in [5] provides a good starting point.