Keywords

1 Introduction

Privacy-preserving data publishing (PPDP) is the emerging area where data is published to the third party while preserving privacy of individuals whose data is contained in the published data. Transactional data is the real-world dataset generated by the widely used applications such as retail store and healthcare. The dataset is of the form {id: a1, a2an} where id denotes the identity of user and a1, a2an denote the set of attributes belongs to the user.

The primary requirement in PPDP is the protection of identity disclosure [1]. km-anonymity [1] is the model which ensures the protection of identity disclosure in transactional data with minimal information loss. Km-anonymity model ensures every m no. of items should occur in k transactions. There are numerous methods are available to achieve km-anonymity. Disassociation [2] is the method which is based on bucketization to achieve km-anonymity while incurring less information loss. There are three phases in Disassociation—(i) Horizontal partitioning, (ii) Vertical partitioning, and (iii) Refining.

In first phase, the similar transactions are put into one cluster. In the second phase, the cluster is converted into km-anonymous record chunks by placing infrequent item combinations in different record chunks. In any of the privacy-preserving data publishing methods, there are two steps, first, make equivalence classes of similar records, and second, apply anonymization method to anonymize the records in the equivalence class. Therefore, if there are more similar records in a cluster, then there will be less modifications to achieve the desired privacy level and data utility will be maintained. Thus, creation of good equivalence classes/clusters is the main step of PPDP.

Ant colony optimization (ACO) is the technique used by insects existing in adjacent colonies in the search for food. If a source of food is found by any ant team/colony, then some teams of ants follow diverse paths searching this food, leaving behind pheromone trail, a chemical usually excreted by animals, and is of great importance for insects. The pheromone trail directs the other ants, and with its help, other ants follow the way laid down by the ants moving in front of them. Few ant teams will reach the food source prior to the other teams due to the fact that they would have traversed the shortest path, and then they will follow the same path to go back to their colony before the other ant teams. Now this shortest path will have the pheromone trails as the team have traversed this path and came back before other team following another path; therefore, probability of other teams taking the same path over other paths is much higher, lest some other paths (better) are discovered by another teams. Pheromone trail of the shortest path is expected to be more concentrated than the other paths.

Inspiration of the ant-based clustering algorithm comes from the clustering of corpses and larval sorting events found in real ant colonies. Deneubourg et al. [3] have first started the study in this field. He has proposed a basic model in which objects in clusters are randomly moved, picked up, and dropped as per the similarity found in surrounding objects. LF algorithm proposed by Lumer and Faieta [4] which is an extension of basic model, which is applicable for numerical datasets. In this algorithm, ants are considered as agents who travel in a four-sided grid in a random fashion. These agents pick up, transport, and drop the data items scattered within this environment. Operations (picking and dropping) are executed as per the similarity and density of the data items found in the ants’ neighborhood: either isolated or data items surrounded by dissimilar ones are likely to be picked up by ants, and ants have a tendency to drop them near the comparable ones. This is how elements are clustered and sorted in the grid. The ant colony clustering algorithm are more flexible, robust, and decentralized [5,6,7] than traditional methods.

The paper proposed the use of ant colony clustering algorithm on transactional dataset for making optimized clusters of similar transactions.

The rest of the paper is structured as follows: Sect. 2 presents the related work in this domain. Section 3 proposes application of ant colony clustering algorithm for efficient clustering of transactional data. The results of the implementation of the proposed algorithm and its comparison with related approach are discussed in Sect. 4. Lastly, we summarize and conclude the work followed by future work.

2 Related Work

The application of ACO to solve the clustering problem was introduced by Shelokar et al. [8]. Firstly, the sample data is represented by each string element, and its content signify the cluster number which the sample data allotted to. Each ant in the ACO at that time builds a solution on the basis of string representation. As per [9], the ant algorithm can be segregated into two sets to achieve clustering, ant-based sorting, and ACO based clustering. Ant-based sorting algorithm uses 2D grid. As per that algorithm, foremost the objects are scattered randomly. Afterward, objects dissimilar to its neighborhood are picked up by artificial ants and transfer it to the cluster containing similar objects. The proposed solution was also used in the studies [10,11,12,13]. Though a defined cluster number is not required in the beginning by ant-based sorting, the processing time will be high as it requires post-processing to recognize the generated clusters [9]. This was proven in few prior studies where the analysis of the cluster number should be done visually once the clustering is completed [12]. ACO based clustering is another ant algorithm for clustering which uses the same idea of solution string to denote the clustering solution. The solution string is built on each iteration and assessed by the objective function to discover the most optimal one. Although a defined cluster number is a prerequisite, ACO based clustering is more efficient in computation than ant-based sorting. Also, once the clustering is done, it does not require post-processing [9]. Apart from ACO, some of the proposed clustering algorithms also practice the same concept of solution string as ACO based clustering [14,15,16]. ACOC [17] is the first implementation of ACO based clustering. After that, ACOC has been enhanced in some studies such as [18] which revised the original ACOC by keeping the identified best solution as the initial solution for the next iteration and adding the ability to determine the optimal cluster number automatically using Jaccard index. The study has demonstrated that the algorithm takes more time to run. The research [17] have adopted another methodology by combining the ACOC with k-means algorithm. In this, the ACO explores the initial solution generated by k-means. However, the algorithm was only tested on financial services data processing. Besides, in research [19], ACO based clustering concept was used; however, it builds the classification model according to the training dataset which is clustered using ACO. The fast ant colony optimization for clustering (FACOC) improves the efficacy of computation in ACOC [20]. In FACOC, the threshold value is used to define whether a cluster number turn out to be common for an object once it is being selected for multiple times. If a cluster number for an object turns out to be common, then that cluster number will be selected without computing the probability in the next iteration for that particular object. With this, the redundant computations can be reduced, enhancing the execution time. In addition, local search will not affect the object with common cluster number. However, the result indicates that FACOC outputs have inferior clustering quality than ACOC.

3 Proposed Approach: Application of Ant Colony Clustering Algorithm for Efficient Clustering of Transactional Data

We propose an algorithm in which efficient partitions are created using ant colony clustering algorithm and then utilizes VERPART algorithm [21] to finally achieve km-anonymity. The clusters are initialized using HORPART algorithm [21]. HORPART algorithm selects the most frequent item “a” in the dataset and splits the dataset into two partitions, the records which contain “a” come in one partition and the rest of the records come in another partition. The process of splitting like this will continue till we get the partitions of predefined size, say P1, P2Pn.

At the beginning, an ant m is associated to a partition PI, and during the iterations, the ant will select the most dissimilar transaction di of partition PI, and another partition PJ is selected at random using a roulette-wheel with probability pIJ, where pIJ depends on the pheromone trail and a local heuristic. Ant will assign di to the partition PJ. The value of the pheromone trail is modified according to the rule.

$$\tau _{{xy}} = (1 - \rho )\tau _{{xy}} + \Delta \tau _{{xy}}^{k}$$
(1)

where τxy is the amount of pheromone deposited for a state transition from partition x to partition y, ρ is the pheromone evaporation coefficient where \(\rho \in \left[ {0,~1} \right]\) and \(\Delta \tau _{{xy}}^{k}\) is the amount of pheromone deposited by kth ant.

$$\Delta \tau _{{xy}}^{k} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\;{\text{ant}}\;k\;{\text{transfer}}\;{\text{a}}\;{\text{transaction}}\;{\text{from}}\;x\;{\text{to}}\;y} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(2)

The local heuristic or short-term visibility is defined as.

$$\eta _{{IJ}} = \frac{{A \cap B}}{{A \cup B}}$$
(3)

where A and B are the transactions. If the transaction A is more similar to the transaction B of particular partition, then it gives a big value in order to influence in the probability of assigning it to the partition. If ant m is at partition I, partition J is chosen with probability

$$p_{{XY}}^{k} = \frac{{\left( {\tau _{{XY}}^{\alpha } } \right)\left( {\eta _{{XY}}^{\beta } } \right)}}{{\mathop \sum \nolimits_{{z \in {\text{allowed}}_{X} }} \left( {\tau _{{XZ}}^{\alpha } } \right)\left( {\eta _{{XZ}}^{\beta } } \right)}}$$
(4)

To find whether the obtained solution is better than previous or not, the Jaccard similarity is calculated.

$$B\left( P \right) = \mathop \sum \limits_{{I = 1}}^{n} JS\left( {P_{I} } \right)$$
(5)

Then, dissimilar transaction d of partition I is assigned to the partition J.

figure a

4 Implementation and Results

The proposed approach is implemented in Python and tested on INFORMS dataset.Footnote 1 In ACO, there are five parameters that needs to be fixed. In the literature [22], following conditions have been specified on selecting these values:

  • β has to be larger than α; so that destination cluster should be chosen based on local heuristic, i.e., similarity with the destination cluster instead of deposited pheromone.

  • α, β <= 1 is better than α, β > 1;

  • ρ = 0.8 is better than ρ = 0.7 and ρ = 0.9 decided by set of experiments.

Considering the above conditions, the following values are considered:

  • M = number of clusters

  • tmax = 100

  • α = 0.8

  • β = 1

  • ρ = 0.8

  • τ0 = 0.001

The results are analyzed in terms of information loss while achieving privacy-preserving data publishing. To evaluate information loss, we have used relative error measure [21] which measure the loss in the association of items occurred while anonymization shown in Fig. 1. The relative error is calculated for different values of k (= 4, 5, 6, 7) and for the anonymized data using proposed algorithm and Disassociation algorithm, respectively. The result shows that if data is anonymized using the proposed algorithm, it gives lower values of relative error for each k than Disassociation algorithm. The lower value of relative error shows that more items are still associated in the anonymized data and, thus, preserves data utility.

Fig. 1
figure 1

Relative error

5 Conclusion

The paper has clearly shown that if we can create the equivalence classes/clusters which have similar records result in less information loss due to anonymization process. Thus, data utility maintained. The proposed algorithm uses ant colony optimization to further refine the equivalence classes/clusters; it shows the significant improvements in equivalence class and, thus, reduces the information loss cause by anonymization. The proposed approach has been tested on INFORMS dataset, and it gives lower relative error than Disassociation algorithm. In future, the applicability of other nature-inspired algorithm can be tested and compared.