Keywords

1 Introduction

A microdata file of p individuals (people, companies, etc.) and d variables (or attributes) is, in practice, a matrix \(A\in \mathbb {R}^{p \times d}\) whose element \(a_{ij}\) provides the value of attribute j for individual i, and whose row \(a_i\) gives the d attributes for individual i. Formally, a microdata file is a mapping \(M: S \subseteq P \rightarrow V_1 \times \ldots \times V_t\), where P is a population, S is a sample of the population and \(V_i\) is the domain of the attribute \(i \in \{1,\ldots ,d\}\).

Microdata files must be protected before being released; otherwise, confidential individual information would be jeopardized. Microaggregation [5, 6] is a statistical disclosure control technique, mainly for numeric variables, which is related with k-anonymity [20].

The goal of microaggregation is to modify the values of the variables such that the released microdata satisfies k-anonymity. Therefore, it first partitions the individuals (or points in \(\mathbb {R}^d\)) into subsets of size at least k, called clusters, and it then replaces each point in the cluster with the centroid of the cluster in order to minimize the loss of information, called spread. In practical cases, the value of k is relatively small (e.g., \(3\le k \le 10\), see [6]). A widely used measure for evaluating the spread is the sum of squared errors (SSE) [6]:

$$\begin{aligned} SSE = \sum \limits _{i=1}^q \sum \limits _{j=1}^{n_i} (a_{i_j} - \overline{a}_i)^T(a_{i_j} - \overline{a}_i), \end{aligned}$$
(1)

where q denotes the number of clusters, \(n_i\) the size of cluster \(\mathcal{C}_i=\{a_{i_j}, j=1,\ldots ,n_i\}\), and \(\overline{a}_i = \frac{1}{n_i}\sum _{j=1}^{n_i} a_{i_j}\) its centroid, for \(i=1,\ldots ,q\). An equivalent measure that is also widely used in the literature is the information loss (IL), which is defined as

$$\begin{aligned} IL= \frac{SSE}{SST}\cdot 100, \end{aligned}$$
(2)

where SST is the total sum of squared errors for all the points, that is:

$$\begin{aligned} SST= \sum _{i=1}^p (a_i - \bar{a})^{\top }(a_i - \bar{a})\quad \hbox { where } \bar{a}= \frac{\sum _{i=1}^p a_i}{p}. \end{aligned}$$
(3)

IL always takes values within the range [0, 100]; the smaller the IL, the better the clustering. From now on, we will denote as feasible clustering a partition into clusters of size at least k.

Finding the partition that minimizes IL (or SSE) and satisfies the cardinality requirement \(n_i\ge k\) for \(i=1,\dots ,q\) is a difficult combinatorial optimization problem when \(d>1\) (multivariate data), which is known to be NP-hard [15]. For univariate data (that is, \(d=1\))—which in practice are the exception—microaggregation can be solved in polynomial time using the algorithm of [11], which is based on the shortest path problem.

Microaggregation differs from other related clustering problems, such as k-medians or k-means [10], specifically in that it imposes a lower bound k on the cardinality of each cluster, but no fixed number of clusters. On the other hand, k in k-medians and k-means fixes the number of clusters, while imposing no constraint on the cardinality of each cluster.

There exist various papers on heuristic algorithms for feasible solutions to multivariate microaggregation with reasonable IL. Heuristics like minimum distance to average (MDAV) [6, 8] and variable minimum distance to average (VMDAV) [19] sequentially build groups of fixed (MDAV) or variable (VMDAV) size based on considering the distances of the points to their centroid. Other approaches first order the multivariate points and apply the polynomial time algorithm of [11] to the ordered set of points, such as in [7], which used several fast ordering algorithms based on paths in the graph that is associated with the set of points, whereas [14] used (slower) Hamiltonian paths (which involve the solution of a traveling salesman problem). The heuristic of [16] also sequentially builds a set of clusters attempting to locally minimize IL. Other approaches, such as those of [2, 13], are based on refining the solutions previously provided by another heuristic.

To our knowledge, the only two papers in the literature to apply optimization techniques to microaggregation and formulate it as a combinatorial optimization problem are [1] and [4]. Both of them apply a column generation algorithm inspired by the work in [9]. Those optimization approaches solve the linear relaxation of the integer microaggregation problem, thus computing a (usually tight) lower bound for the problem. They also provide a (usually very good) upper bound solution with an IL that is smaller than the ones reported by other heuristics. Note that having a lower bound of the optimal solution is instrumental in order to know how good are the (upper bound) solutions computed by heuristics, even to perform fair comparisons between them. For instance, the heuristic introduced in [17] reported IL values below the certified lower bound—thus, not possible—, which clearly indicates that the values of the datasets used in that paper were different than those in the rest of the literature (likely due to some sort of normalization of attributes).

The downside of those optimization based techniques is that, when the dataset is large, the column generation may involve a large number of iterations, thus making it computationally very expensive. The main difference between the approaches in [1] and [4] is that the pricing problem of the former involves a nonlinear integer problem while the latter requires a simpler linear integer problem. In practice this means that the pricing subproblem in [1] can be tackled only by means of complete enumeration and only for small values of k, while [4] theoretically offers more flexibility and can deal with larger values of k.

Since optimization-based methods can be inefficient for large datasets but can provide high quality solutions in reasonable time for microdata with a small number of points, this work suggests a new approach consisting of first partitioning the set of points, and then applying an optimization approach to each smaller subset. The initial partitioning of the dataset is done according to a feasible clustering previously computed with the MDAV/VMDAV heuristics. Additionally, a local search improvement heuristic is also applied twice: first, to the solution provided by the MDAV/VMDAV heuristics prior to the partitioning; and second, to the final solution provided by the optimization approach.

This short paper is organized as follows. Section 2 outlines the optimization-based decomposition heuristic. Sections 3 and 4 outline the two main building blocks of the heuristic: the local search improvement algorithm and the mixed integer linear optimization method based on column generation in [4]. Section 5 shows the preliminary results from this approach with the standard Tarragona and Census datasets used in the literature.

2 The Decomposition Heuristic

The decomposition heuristic comprises the following steps:

  • Input: Microdata matrix \(A^0\in \mathbb {R}^{p \times d}\), minimum cluster cardinality k, upper bound of the number of subsets s in which the microdata will be decomposed.

    1. 1.

      Standardize attributes/columns of \(A^0\), obtaining matrix \(A\in \mathbb {R}^{p \times d}\). Compute the squared Euclidean distance matrix \(D\in \mathbb {R}^{p\times p}\), where \(D_{ij}= (a_i-a_j)^{\top }(a_i-a_j)\), which is to be used in the remaining steps.

    2. 2.

      Apply MDAV and VMDAV microaggregation heuristics using D. Let \(\mathcal{C}=\{\mathcal{C}_1, \dots , \mathcal{C}_q\}\) be the best of the two feasible clusterings provided by MDAV and VMDAV (that is, the one with smallest IL). Here, q represents the number of clusters, and \(\mathcal{C}_i\) the set of points in cluster i.

    3. 3.

      Apply the local search improvement algorithm (described in Sect. 3) to \(\mathcal{C}\), obtaining the updated clustering \(\mathcal{C'}=\{\mathcal{C'}_1, \dots , \mathcal{C'}_q\}\). The updated clustering \(\mathcal{C'}\) has the same number of clusters q as \(\mathcal{C}\), but the points in subsets \(\mathcal{C'}_i\) and \(\mathcal{C}_i\) can be different.

    4. 4.

      Partition the microdata and distance matrices A and D in \(s'\le s\) subsets \(\mathcal{S}_i, i=1,\dots ,s'\), according to the clustering \(\mathcal{C'}\). For this purpose we compute \(\bar{p} = round(p/s)\), the minimum number of points in each subset of the partition, and build each subset \(\mathcal{S}_i\) by sequentially adding points of clusters \(\mathcal{C'}_j, j=1,\dots ,q\) until the cardinality \(\bar{p}\) is reached.

    5. 5.

      Apply the mixed integer linear optimization method based on column generation in [4] to each microaggregation subproblem defined by \(A_{\mathcal{S}_i}\) and \(D_{\mathcal{S}_i}\), \(i=1,\dots ,s'\). Obtain feasible clustering \(\mathcal{O}_i\) for points in \(\mathcal{S}_i\), \(i=1,\dots ,s'\).

    6. 6.

      Perform the union of clusterings \(\mathcal{O}= \mathcal{O}_1 \cup \dots \cup \mathcal{O}_{s'}\). \(\mathcal{O}\) is a feasible clustering for the original microdata A.

    7. 7.

      Finally, once again apply the local search improvement algorithm from Sect. 3 to \(\mathcal{O}\) in order to obtain the final microaggregation \(\mathcal{O'}\).

  • Return: Clustering \(\mathcal{O'}\).

Step 3 of the algorithm can be skipped, thus obtaining the partition in Step 4 with the clustering \(\mathcal{C}\) from Step 2. However, we have observed that better results are generally obtained if the local search improvement heuristic is applied in both Steps 3 and 7, not only in Step 7. Indeed, if efficiency is a concern, it is possible to stop the whole procedure after Step 3, which thus returns cluster \(\mathcal{C'}\) as a solution and, in general, significantly outperforms the solution obtained in Step 2. In this way, it is possible to avoid Step 5, which is usually computationally expensive.

Note also that the clustering \(\mathcal{C'}\) obtained in Step 3 is used only in Step 4 to decompose the microdata into subsets, but not as a starting solution for the optimization procedure in Step 5. Therefore, the clustering computed in Steps 5–6 by the optimization procedure might have a larger SSE than \(\mathcal{C'}\). On the other hand, by not starting the optimization procedure in Step 5 with the solution \(\mathcal{C'}\) we have some chances to obtain a different and possibly better local solution. In the current implementation, \(\mathcal{C'}\) is not used as a starting solution for the optimization algorithm.

The larger the value of s, the faster the algorithm will be, since the minimum number of points \(\bar{p} = round(p/s)\) in each subset \(\mathcal{S}_i, i=1,\dots ,s'\) will be smaller, and therefore the optimization algorithm of [4] will be more efficient. However, the final IL (SSE) of the final clustering \(\mathcal{O'}\) also increases with s. Therefore parameter s can be used as a trade-off between efficiency and solution quality.

In the next two sections, we outline the local search improvement heuristic used in Steps 3 and 7, as well as the mixed integer linear optimization method in Step 5.

Fig. 1.
figure 1

Two-swapping local search improvement heuristic

3 The Local Search Improvement Heuristic

Given a feasible clustering for the microaggregation problem, a local search algorithm tries to improve it by finding alternative solutions in a local neighborhood of the current solution. The local search considered in this work is a two-swapping procedure; in addition to its simplicity, it has proven to be very effective in practice. Briefly, the two-swapping heuristic performs a series of iterations, and at each iteration it finds the pair of points (ij) located in different clusters that would most reduce the overall SSE if they were swapped. This operation is repeated until no improvement in SSE is detected. The cost per iteration of the heuristic is \(O(p^2/2)\). Similar approaches have been used in other clustering techniques, such as in the partitioning around medoids algorithm for k-medoids [12]. The two-swapping algorithm implemented is shown in Fig. 1.

4 The Mixed Integer Linear Optimization Algorithm Based on Column Generation

In this section we quickly outline the optimization method presented in [4]. Additional details can be found in that reference.

The formulation of microaggregation as an optimization problem in [4] is based on the following property of the \(SSE_h\) of cluster \(\mathcal{C}_h=\{a_{h_i}, i=1,\ldots ,n_h\}\) (see [4, Prop. 3] for a proof):

$$\begin{aligned} \begin{aligned} SSE_h&= \displaystyle \sum _{i=1}^{n_h} (a_{h_i} - \overline{a_h})^{\top }(a_{h_i} -\overline{a_h}) \\&= \displaystyle \frac{1}{2 n_h} \sum _{i=1}^{n_h} \sum _{j=1}^{n_h} (a_{h_i} - a_{h_j})^{\top }(a_{h_i} - a_{h_j}) = \frac{1}{2 n_h} \sum _{i=1}^{n_h} \sum _{j=1}^{n_h}D_{h_i h_j}. \end{aligned} \end{aligned}$$
(4)

That is, for computing \(SSE_h\), we do not need the centroid of the cluster, but only the distances between the points in the cluster.

From (4), defining binary variables \(x_{ij}\), \(i,j=1,\dots ,p\) (which are 1 if points i and j are in the same cluster, 0 otherwise), then the microaggregation problem can be formulated as:

$$\begin{aligned} \min \quad&\displaystyle SSE\triangleq \frac{1}{2} \sum _{i=1}^{{p}} \frac{\sum _{j=1,j\ne i}^{{p}} D_{ij} x_{ij}}{\sum _{j=1 , j\ne i}^{{p}} x_{ij} + 1} \end{aligned}$$
(5a)
$$\begin{aligned} \hbox {s. to}\quad&x_{ir} + x_{jr} - x_{ij} \le 1 \ \ i,j,r=1,\dots ,p, \ \ i\ne j, r \ne j, i\ne r \end{aligned}$$
(5b)
$$\begin{aligned}&\displaystyle \sum _{j=1, j\ne i}^p x_{ij} \ge k-1 \;\; i=1,\dots ,p \end{aligned}$$
(5c)
$$\begin{aligned}&x_{ij}= x_{ji}, x_{ij}\in \{0,1\}, i,j=1,\dots , p . \end{aligned}$$
(5d)

Constraints (5b) are triangular inequalities, that is, if points i and r, and r and j are in the same cluster, then points i and j are also in the same cluster. Constraints (5c) guarantee that the cardinality of the cluster is at least k. The denominator in the objective function (5a) is the cardinality of the cluster that contains point i. Unfortunately, (5) is a difficult nonlinear and nonconvex integer optimization problem (see [4] for details).

A more practical alternative is to consider a formulation inspired by the clique partitioning problem with minimum clique size of [9]. Defining as \(\mathcal{C}^*= \{ \mathcal{C}\subseteq \{1,\dots ,p\}: k \le |\mathcal{C}| \le 2k-1 \}\) the set of feasible clusters, the microaggregation problem can be formulated as:

$$\begin{aligned} \begin{array}{rll} \min \quad &{} \displaystyle \sum _{\mathcal{C} \in \mathcal {C}^*} w_\mathcal{C}x_\mathcal{C} \\ \hbox {s. to} \quad &{}\displaystyle \sum _{\mathcal{C} \in \mathcal {C}^*: i \in \mathcal{C}} x_\mathcal{C} = 1 \quad &{} i\in \{1,\dots ,p\} \\ &{} x_\mathcal{C} \in \{0,1\} &{} \mathcal{C} \in \mathcal {C}^*, \end{array} \end{aligned}$$
(6)

where \(x_\mathcal{C}=1\) means that feasible cluster \(\mathcal C\) appears in the microaggregation solution provided, and the constraints guarantee that all the points are covered by some feasible cluster.

From (4), the cost \(w_\mathcal{C}\) of cluster \(\mathcal{C}\) in the objective function of (6) is

$$\begin{aligned} w_C = \frac{1}{2\,|C|} \sum _{i \in \mathcal{C}} \sum _{j \in \mathcal{C}} D_{ij}. \end{aligned}$$
(7)

The number of feasible clusters in \(\mathcal{C}^*\)—that is, the number of variables in the optimization problem (6)—is \(\sum _{j=k}^{2k-1} {p \atopwithdelims ()j}\), which can be huge. For instance, for \(p=1000\) and \(k=3\) we have \(|\mathcal{C}^*|= 8.29\cdot 10^{12}\). However, the linear relaxation of (6) can be solved using a column generation technique, where the master problem is defined as (6) but it considers only a subset \(\bar{\mathcal{C}} \subseteq \mathcal{C}^*\) of the variables/clusters. The set \(\bar{\mathcal{C}}\) is updated at each iteration of the column generation algorithm with new clusters, which are computed by a pricing subproblem. The pricing subproblem either detects that the current set \(\bar{\mathcal{C}}\) contains the optimal set of columns/clusters or, otherwise, it generates new candidate clusters with negative reduced costs. For small datasets and values of k, the pricing subproblem can be solved by complete enumeration; otherwise, an integer optimization model must be solved. The master problem requires the solution of a linear optimization problem. Both the linear and integer optimization problems were solved with CPLEX in this work. The solution of the linear relaxation of (6) provides a lower bound to the microaggregation problem (usually a tight lower bound). In addition, solving the master problem as an integer problem allows us to obtain a feasible solution to the microaggregation problem (usually of high quality). A thorough description of this procedure, and of the properties of the pricing subproblem, can be found in [4] and [18].

5 Computational Results

The algorithm in Sect. 2 and the local search heuristic in Sect. 3 have been implemented in C++. We used the code of [4] (also implemented in C++) for the solution of the mixed integer linear optimization approach based on column generation, as described in Sect. 4. A time limit of 3600 s was set for the solution of each subproblem with the column generation algorithm in Step 5 of the heuristic in Sect. 2. We tested the decomposition algorithm with the datasets Tarragona and Census, which are standard in the literature [3]. The results for Tarragona and Census are shown, respectively, in Tables 12 and 34. These tables show, for each instance and value \(k\in \{3,5,10\}\), the IL and CPU time for the main steps of the decomposition heuristic in Sect. 2. We also tried the different values \(s\in \{40,20,10,5,2\}\) for partitioning the dataset in Step 5. For Step 2, the tables also show which of the MDAV or VMDAV algorithms reported the best solution. The difference between Tables 1 and 2 (also between Tables 3 and 4) is that the former show results with the Step 3, while in the latter this step was skipped. We remind the reader that the decomposition algorithm could be stopped after Step 3 with a feasible and generally good solution. However, the best IL values for each k, which are marked in boldface in the tables, are obtained after Step 7, although this means going through the usually more expensive Step 5.

Table 1. Results for the Tarragona dataset considering Step 3 of the algorithm. The best IL for each k is marked in boldface.
Table 2. Results for the Tarragona dataset without Step 3 of the algorithm. The best IL for each k is marked in boldface.
Table 3. Results for the Census dataset considering Step 3 of the algorithm. The best IL for each k is marked in boldface.
Table 4. Results for the Census dataset without Step 3 of the algorithm. The best IL for each k is marked in boldface.

From Tables 1, 2, 3 and 4 we conclude that:

  • In general, the smaller the k and larger the s, the faster the decomposition heuristic. In a few cases, however, smaller values of s also meant smaller CPU times; for instance, this is observed for Census, \(k=10\), and values \(s=20\) and \(s=10\). The explanation is that the maximum time limit was reached in some pricing subproblems in those runs, and therefore the CPU time increased with s.

  • In general, smaller ILs are obtained for smaller s values, as expected.

  • When \(k=10\), Step 5 is faster for \(s=2\) or \(s=5\), which was initially unexpected. The reason is that, when k is large and s is small, the column generation optimization algorithm generates new columns heuristically, reaching the maximum allowed space of 3000 columns; thus the solution of the difficult mixed integer linear pricing subproblems is never performed. However, in those cases, poorer values of IL were obtained.

  • In general, the best IL values are obtained when using Step 3, with the exception of Tarragona and \(k=10\), whose best IL was given in Table 4 without Step 3. This can be due to the randomness associated with partitioning the dataset into s subsets.

  • The solution times in Step 5 are generally longer when k is large and s is small.

Table 5. Comparison of best IL values obtained with the new approach vs. those found in the literature (citing the source).

Finally, Table 5 summarizes the best IL results obtained with the new approach, comparing them with—to our knowledge—the best values reported in the literature by previous heuristics (citing the source), and the optimization method of [1]. The approaches implemented by those other heuristics were commented in Sect. 1 of this paper. It can be seen that the new approach provided a better solution than previous heuristics in all the cases. In addition, for \(k\in \{3,5\}\), the new approach also provided IL values close to the ones provided by the optimization method of [1], usually while requiring fewer computational resources. For instance, with our approach, the solutions for Tarragona and \(k=3\) and \(k=5\) required, respectively, 7 and 1127 s; whereas the optimization method of [1] (running on a different—likely older—computer) needed, respectively, 160 and 4779 s. For Census and \(k=3\) and \(k=5\), the solution times with our approach were, respectively, 10 and 5346 s, whereas that of [1] required, respectively, 3868 and 6788 s. The optimization method of [1] is unable to solve problems with \(k=10\), and in this case our new approach reported—as far as we know—the best IL values ever computed for these instances.

6 Conclusions

We have presented here the preliminary results from a new heuristic approach for the microaggregation problem. This method combines three ingredients: a partition of the dataset; solving each subset of the partition with an optimization-based approach; and a local search improvement heuristic. The results have shown that our new approach provides solutions that are as good as (and in some cases even better than) those reported in the literature by other heuristics for the microaggregation problem, although it may generally require longer executions times. Future research will investigate improving Step 5 of the heuristic algorithm and will further consider more sophisticated large-neighborhood search improvement heuristics.