A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization

Dwivedi, Rajesh; Tiwari, Aruna; Bharill, Neha; Ratnaparkhe, Milind

doi:10.1007/978-3-031-30111-7_55

Rajesh Dwivedi¹²,
Aruna Tiwari¹²,
Neha Bharill¹³ &
…
Milind Ratnaparkhe¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

International Conference on Neural Information Processing

992 Accesses
2 Citations

Abstract

Machine learning, data mining, and pattern recognition all require feature selection when working with high-dimensional data. Feature selection helps in improving the prediction accuracy and significantly reduces the computation time. The problem is that many of the feature selection algorithms use a sequential search strategy to choose the most important features. This means that each time you add or remove a feature from the dataset, you get stuck in a local optimum. This paper proposes a hybrid feature selection technique based on ant colony optimization that randomly selects features and quantifies their quality using K-means clustering in terms of silhouette index and laplacian score. The proposed hybrid feature selection technique allows for random selection of features, which facilitates a better exploration of feature space and avoids the problem of being trapped in a local optimal solution, while also generating a global optimal solution. Furthermore experimental investigation shows that the proposed method outperforms the state-of-the-art method.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Novel Clustering-Based Hybrid Feature Selection Approach Using Ant Colony Optimization

Article 09 March 2023

Exploring Ant Colony Optimization for Feature Selection: A Comprehensive Review

Comparative Analysis of Traditional and Optimization Algorithms for Feature Selection

Keywords

1 Introduction

Feature selection is widely used in various data mining and machine learning tasks such as classification, clustering, and regression to improve readability and interpretability. Due to the popularity of feature selection in such areas, so far, researchers have primarily attempted to analyze and explain feature selection tasks in supervised learning area, especially in classification, but in unsupervised learning area [15], especially clustering [8] it has not been explored extensively. A feature selection method generally consists of four main phases: selection, examination, terminating criterion, and validation. The first phase involves selecting a feature subset using a predetermined search strategy, like sequential search, sequential floating search, or complete search. The second phase consists of examining the chosen feature subset by a specific criterion. After getting termination criteria, the third step selects the best performing subset from all possible subsets. The last step involves the validation of the chosen subset using the validation metrics.

The remaining portion of this paper is arranged as follows: Sect. 2 provides a literature review of the existing work. The proposed feature selection method based on Ant Colony Optimization is introduced in Sect. 3. Section 4 presents experimental results on various benchmark datasets. Finally, Sect. 5 presents the conclusion and future work.

2 Literature Review

This section presents the various techniques for features selection proposed by other researchers. Dash and Liu [2] developed a hybrid feature selection approach that calculates entropy from the similarity of data and uses a measure based on entropy to evaluate the features in the filter stage. The wrapper stage uses scatter separability criteria and k-means clustering to select the relevant feature subset. A drawback of this approach is its high computation cost. Later on, Hruschka et al. [7] proposed a hybrid feature selection approach that uses a Bayesian filter with k-means clustering to identify the relevant feature subset. They used a Bayesian network that uses Markov blanket property for the filter approach. A drawback of this approach is that they have only tested it for datasets having less than 30 features. By adopting the same idea proposed by Dash and Liu [2], Li et al. [9] proposed a new hybrid feature selection approach, in which they used Fuzzy Feature Evaluation Index (FFEI) with an exponential entropy index to evaluate the feature in the filter stage to increase the performance. They used a scatter separability criterion and Fuzzy C-Means algorithm for the wrapper stage. This approach also suffers from high computation costs.

In 2015 another feature selection approach suggested by Nahato et al. [10] uses rough set theory to identify the relevant features. They used rough indiscernibility relation to select the reducts and trained backpropagation neural networks using the selected reducts. This method was tested on statlog heart disease datasets, wisconsin breast cancer dataset, and hepatitis dataset taken from UCI machine learning library [1] and achieved an accuracy of 90.4%, 98.6%, and 97.3% with 6, 7, and 13 features, respectively. Later on, in 2016 Solorio et al. [14] proposed a hybrid feature selection technique that uses a laplacian score to rank the features and a modified Calinski Harabase index to measure a feature subset. They tested their approach on various benchmark datasets taken from the UCI Machine learning repository [1] and also on several synthetic datasets and thus achieve better results than approaches proposed by Dash and Liu [2] and Li et al. [9].

In 1990, Dorigo et al. [3] came up with the idea of an Ant Colony Optimization (ACO). This approach mimics the social behaviour of ants searching for food. Initially, it was developed to solve the famous traveling salesman problem. Later it was applied to various complex optimization problems like feature selection [11]. An unsupervised ACO based feature selection was proposed by Tabakhi et al. [17] that uses Cosine similarity measures to measure the similarity between features. In this work, the number of artificial ants used was equal to the number of attributes in the dataset such that each ant was responsible for constructing a feature subset. The frequency of selected attributes on different subsets was used to update the pheromone value. Feature having low similarity, and high pheromone value was added to feature subset in every step till max iteration. They tested their approach on several UCI machine learning datasets [1] like wine and breast cancer datasets and got an average classification error of 19.8%. In 2017, Dhalia et al. [16] proposed another ACO based approach that uses a tandem run strategy to select the relevant feature subset. They used Cosine similarity measure to measure the similarity between features and support vector machine (SVM) for assigning the fitness to a feature and further SVM is used for classification. They tested their approach on Lung CT scan images to diagnose bronchitis and achieved 81.66% accuracy.

From the presented work, it can be inferred that hybrid methods are performing well in comparison of filter and wrapper methods. Also, ACO is used in various tasks for the feature selection and gives an increased accuracy; therefore, in the proposed work, a hybrid feature selection approach based on ACO is presented which uses silhouette index and laplacian score as a fitness measure and gives an increased clustering performance on various benchmark datasets.

3 Proposed Method

In this work, we proposed a novel hybrid feature selection technique based on ant colony optimization [4] (NHFS based ACO) that follows the tandem run strategy [5] to select the best feature subset.

The simulation model is expressed by a completely connected undirected graph G = (V, E) having a one-to-one mapping between vertices and features. Hence the number of vertices ($v_{n}$) equals the number of features ($f_{n}$). V denotes the set of vertices as $v_{1}$, $v_{2}$, $v_{3}$....$v_{n}$ and E denotes the set of edges ($e_{1}$, $e_{2}$, $e_{3}$....$e_{n}$) joining any two vertices in the graph. In this model, the number of artificial ants ($N_{ant}$) is taken same as the number of features ($f_{n}$) to avoid being trapped in the local optimum, so $f_{n}=v_{n}=N_{ant}$. In ACO each artificial ant constructs a feature subset ($F_{i}$). The N denotes the set of all feature subsets created by ants and $n_{max}$ indicates the maximum number of features possible in each subset, then $N_{ant} = N$ and 0 $\le $ $n_{max}$ $\le $ n.

To make the feature subsets, each ant starts from a vertex and creates a feature subset by traversing different vertices in between. Every feature is associated with a pheromone value ($\alpha $) set to a constant initially. $n_{tan}$ denotes the number of features selected by the tandem run strategy.

In this method, three steps are used to choose $n_{max}$ features. In the first step, n feature subsets are made by picking $n_{max}$ features at random. Then, we apply K-means clustering to these subsets and evaluate their efficacy in terms of silhouette index (SI) values. The leader subset is the one with the highest SI value $(g_{bestset})$. Algorithm 1 describes the working of first step. In the second step, n feature subsets are made in a different way, and the selection of $n_{max}$ features is accomplished in three stages. Certain features are chosen randomly $(n_{random})$, while others $(n_{arbitary})$ are chosen based on their high pheromone and low laplacian scores. On the other hand, some features $(n_{tan})$ from the leader subset are chosen because they have a high pheromone score but a low laplacian score. Once again, these subsets are used in K-means clustering to evaluate how effective they are in terms of a SI value that is determined in the third step. The subset with the highest SI is called localbest $(l_{bestset})$, and if it is greater than globalbest, it becomes globalbest $(g_{bestset})$. Iterate the second and third steps until $max_{iter}$. After all iterations, the global bestset is the subset with the highest SI value. The working of second and third step are shown in Algorithm 2.

3.1 Computation of Laplacian Score

Laplacian score [6] represents the local preserving power of a feature. It is used for feature ranking in many feature selection approaches. A good feature always has a low laplacian score value. For a given dataset with m instances, a similarity graph is constructed in the form of the weight matrix W of size $m*m$ such that $W=\left\{ w_{11},w_{12},w_{13},...w_{ij},....w_{mm} \right\} $, where each edge connecting instances ${\boldsymbol{x}}_{i}$ to $x_{j}$ represents similarity in form of weight $w_{ij}$. Laplacian matrix L is calculated as defined in Eq. (1).

$$\begin{aligned} L = D-W \end{aligned}$$

(1)

where D is the diagonal matrix and W is the weight matrix.

Let’s $f_{r}$ is the $r^{th}$ feature in all m instances then

$f_{r} = (f_{r1},f_{r2},f_{r3},f_{r4},f_{r5},........f_{rm})^T$ where $r \in \left[ 1,n \right] $. Laplacian score of $f_{r}$ is computed as defined in Eq. (2).

$$\begin{aligned} L_{r} = {\tilde{f_{r}}}^{T}L\tilde{f_{r}}/{\tilde{f_{r}}}^{T}D\tilde{f_{r}} \end{aligned}$$

(2)

$\tilde{f_{r}}$ denotes the $f_{r}$ vector’s deviation from the mean and computed as given in Eq. (3).

$$\begin{aligned} \tilde{f_{r}} = f_{r} - (\frac{f_{r}^{T}D \boldsymbol{1} }{1^{T}D1}) \end{aligned}$$

(3)

where D is the diagonal matrix and $ 1 = \left[ 1,.....,1 \right] ^T$. $f_{r}^T$ is the transpose of $f_{r}$.

After getting laplacian score heuristic value ($h_{r}$) is computed using Eq. (4).

$$\begin{aligned} h_{r} = \frac{1}{L_{r}} \end{aligned}$$

(4)

3.2 Calculation of Pheromone

Step 1: Allocate initial pheromone ($\alpha $) to all features as given in Eq. (5).

$$\begin{aligned} \alpha _{f_{i}} = \frac{1}{n} \end{aligned}$$

(5)

where $i \in \left[ 1,n \right] $.

Step 2: Each time a feature $f_{i}$ is selected in a subset $F_{j}$, where $i \in \left[ 1,n \right] $ and $j \in \left[ 1,n \right] $ its pheromone update occurs in Eq. (6) and (7).

$$\begin{aligned} fitness_{f_{i}} = \dfrac{SI(F_{j})}{n_{max}} \end{aligned}$$

(6)

$$\begin{aligned} \alpha _{f_{i}}(t+1) = \alpha _{f_{i}}(t) + fitness_{f_{i}} \end{aligned}$$

(7)

where $\alpha _{f_{i}}(t)$ and $\alpha _{f_{i}}(t+1)$ are pheromones value of a feature $f_{i}$ at time t and t + 1, respectively.

4 Results and Discussion

4.1 Dataset Details

In the experimental study, we have used three benchmark datasets namely Ionosphere, Sonar and Vehicle silhouettes, which are collected from the UCI Machine Learning repository [1]. Preprocessing of these datasets is performed such as removal of missing values. After preprocessing, details of the datasets are presented in Table 1.

Table 1. Dataset Details

Full size table

4.2 Evaluation Measures

In this approach, two cluster measures and one visualizer are used to evaluate and visualize the clustering performance which are defined as follows:

Jaccard Index (JI). JI [12] is an external evaluation measure for any clustering approach. It evaluates clustering performance based on its similarity to the ground truth or Expert classification. Jaccard index value range from 0 to 1, where 0 represents no match between clustering and ground truth and 1 illustrates a perfect match.

Silhouette Index (SI). SI [13] is known as an internal evaluation measure for any clustering algorithm. It is based on the similarity of a data point within its cluster known as Cohesion and to its nearest cluster known as separation. It ranges from -1 to 1, and a high SI value represents well-clustered data points. The silhouette index is calculated by taking the average of all data point’s silhouette coefficients.

Silhouette Visualizer. The silhouette visualizer visualizes which clusters are dense and which are not by displaying the silhouette coefficient for each sample on a per-cluster basis. It also shows how many clusters are achieving the average SI value.

4.3 Parameter Settings for NHFS Based ACO

In this work, experiments are carried out by taking different values of $n_{max}$ for all datasets and for each dataset ‘50’ independent runs of experiments were conducted, therefore $max_{iter} = 50$. Parameters settings for various variables is as follows:

$n_{arbitary} = 30\% of n_{max}$

$n_{tan} = 30\% of n_{max}$

In case of fraction value, consider its ceil value.

4.4 Experimental Analysis

Experiments are performed on the datasets listed in Table 1 and the NHFS based ACO approach is compared with a hybrid feature selection approach developed by Solorio et al. [14] because they used a similar strategy to obtain the best feature subset and measured the results in terms of jaccard index and silhouette index.

Experimental Findings on Ionosphere Dataset: The NHFS based ACO is applied on Ionosphere dataset for different values of $n_{max}$ and results are presented in Table 2.

Table 2. Results on Ionosphere Dataset

Full size table

It can be seen from Table 2 that NHFS based ACO is giving increased JI and SI values in comparison to Solorio et al. [14] approach when 1 feature is selected. Both the approaches also worked on 7 number of features, in spite of NHFS based ACO selected more relevant features and that is by performed better than Solorio et al. [14] approach. The silhouette visualizer obtained from Solorio et al. [14] approach shown in Fig. 1(a) showing that some data points having negative silhouette coefficient in blue colored cluster, which shows that those data points are wrongly clustered. On the other hand silhouette visualizer obtained from NHFS based ACO approach shown in Fig. 1(b) shows that all data points have positive silhouette coefficient value and gives better clustering.

Experimental Findings on Sonar Dataset: The NHFS based ACO is applied on Sonar dataset for different values of $n_{max}$ and combined results are shown in Table 3.

Table 3. Results on Sonar Dataset

Full size table

It can be seen from Table 3 that NHFS based ACO is giving increased JI and SI value in comparison to Solorio et al. [14] approach when 1 feature is getting selected. silhouette visualizer obtained from Solorio et al. [14] and NHFS based ACO are also shown in Fig. 2(a) and Fig. 2(b) to visualize the clustering results.

Experimental Findings on Vehicle Silhouettes Dataset: The NHFS based ACO is applied on Vehicle silhouettes dataset for different values of $n_{max}$ and combined results are presented in Table 4.

Table 4. Results on Vehicle silhouettes Dataset

Full size table

It can be seen from Table 4 that NHFS based ACO approach is giving increased JI and SI values in comparison to Solorio et al. [14] approach in all cases but the highest SI value when 4 features are taken. Both the approaches also worked on 5 number of features, in spite of NHFS based ACO selected more relevant features and that is by performed better than Solorio et al. [14] approach. Silhouette visualizer obtained from Solorio et al. [14] and NHFS based ACO approach with 4 features are also presented in Fig. 3(a) and Fig. 3(b) to observe the clustering results.

4.5 Comparison

Comparison graphs presented in Fig. 4 showing the comparison between Solorio et al. [14] and NHFS based ACO in terms of silhouette index, shows that NHFS based ACO gives better SI value in all datasets. Whereas Fig. 5 showing the comparison between Solorio et al. [14] and NHFS based ACO in terms of jaccard index shows that NHFS based ACO gives better JI value for all datasets.

5 Conclusion

In the proposed method a hybrid feature selection algorithm based on Ant Colony Optimization named as NHFS based ACO is presented, which removes redundant and irrelevant features that have a negative impact on model building and selects the more appropriate features from data having large number of features. The NHFS based ACO approach uses mixture of laplacian score as well as silhouette index to measure the relevancy of a feature rather than using laplacian score in the filter stages and then silhouette index in the wrapper stage, separately. It also uses tandem run strategy to select the most promising features from the leader subset, which improves the power of proposed approach. The proposed approach is tested on 3 benchmark datasets having a large number of features and achieved the better results than other state-of-the-art approach. The proposed method also worked well on Ionosphere dataset and clustered data points in such a way that all data points have a positive silhouette coefficient. The work on feature selection can be expanded by considering other bio-inspired feature selection algorithms as well by taking various other filter measures.

References

Blake, C.: UCI repository of machine learning databases (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Dash, M., Liu, H.: Feature selection for clustering. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 110–121. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45571-X_13
Chapter Google Scholar
Dorigo, M., Gambardella, L.M.: Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput. 1(1), 53–66 (1997)
Article Google Scholar
Dwivedi, R., Kumar, R., Jangam, E., Kumar, V.: An ant colony optimization based feature selection for data classification. Int. J. Recent Technol. Eng. 7, 35–40 (2019)
Google Scholar
Franks, N.R., Richardson, T.: Teaching in tandem-running ants. Nature 439(7073), 153 (2006)
Article Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. Adv. Neural Inf. Process. Syst. 18, 507–514 (2005)
Google Scholar
Hruschka, E.R., Covoes, T.F., Ebecken, N.F.: Feature selection for clustering problems: a hybrid algorithm that iterates between k-means and a Bayesian filter. In: Fifth International Conference on Hybrid Intelligent Systems (HIS 2005), pp. 6-pp. IEEE (2005)
Google Scholar
Kumar, R., Dwivedi, R., Jangam, E.: Hybrid fuzzy C-means using bat optimization and maxi-min distance classifier. In: Singh, M., Gupta, P.K., Tyagi, V., Flusser, J., Ören, T., Kashyap, R. (eds.) ICACDS 2019. CCIS, vol. 1046, pp. 68–79. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9942-8_7
Chapter Google Scholar
Li, Y., Lu, B.L., Wu, Z.F.: A hybrid method of unsupervised feature selection based on ranking. In: 18th International Conference on Pattern Recognition (ICPR 2006), vol. 2, pp. 687–690. IEEE (2006)
Google Scholar
Nahato, K.B., Harichandran, K.N., Arputharaj, K.: Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Computat. Math. Methods Med. 2015 (2015). https://doi.org/10.1155/2015/460189
Nayar, N., Gautam, S., Singh, P., Mehta, G.: Ant colony optimization: a review of literature and application in feature selection. In: Smys, S., Balas, V.E., Kamel, K.A., Lafata, P. (eds.) Inventive Computation and Information Technologies. LNNS, vol. 173, pp. 285–297. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4305-4_22
Chapter Google Scholar
Real, R., Vargas, J.M.: The probabilistic basis of Jaccard’s index of similarity. Syst. Biol. 45(3), 380–385 (1996)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Solorio-Fernández, S., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A new hybrid filter-wrapper feature selection method for clustering based on ranking. Neurocomputing 214, 866–880 (2016)
Article Google Scholar
Solorio-Fernández, S., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A review of unsupervised feature selection methods. Artif. Intell. Rev. 53(2), 907–948 (2020)
Article Google Scholar
Sweetlin, J.D., Nehemiah, H.K., Kannan, A.: Feature selection using ant colony optimization with tandem-run recruitment to diagnose bronchitis from CT scan images. Comput. Methods Programs Biomed. 145, 115–125 (2017)
Article Google Scholar
Tabakhi, S., Moradi, P., Akhlaghian, F.: An unsupervised feature selection algorithm based on ant colony optimization. Eng. Appl. Artif. Intell. 32, 112–123 (2014)
Article Google Scholar

Download references

Acknowledgements

This research is funded by The Council of Scientific and Industrial Research (CSIR), Government of India under grant no. 22(0853)/20/EMR-II.

Author information

Authors and Affiliations

Indian Institute of Technology Indore, Indore, 453552, India
Rajesh Dwivedi & Aruna Tiwari
Ecole Centrale School of Engineering, Mahindra University, Hyderabad, 500043, India
Neha Bharill
ICAR-Indian Institute of Soybean Research, Indore, 452001, India
Milind Ratnaparkhe

Authors

Rajesh Dwivedi
View author publications
You can also search for this author in PubMed Google Scholar
Aruna Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Neha Bharill
View author publications
You can also search for this author in PubMed Google Scholar
Milind Ratnaparkhe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh Dwivedi .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dwivedi, R., Tiwari, A., Bharill, N., Ratnaparkhe, M. (2023). A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_55

Download citation

DOI: https://doi.org/10.1007/978-3-031-30111-7_55
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization

Abstract

Similar content being viewed by others

A Novel Clustering-Based Hybrid Feature Selection Approach Using Ant Colony Optimization

Exploring Ant Colony Optimization for Feature Selection: A Comprehensive Review

Comparative Analysis of Traditional and Optimization Algorithms for Feature Selection

Keywords

1 Introduction

2 Literature Review