Keywords

1 Introduction

Data science is vital in numerous industries, enabling informed decision-making through in-depth data analysis and interpretation. Through the application of techniques like machine learning, businesses can effectively use predictive analytics to anticipate future outcomes and meet customer needs [1]. Clustering algorithms automatically reveal data patterns and relationships. They analyze data to identify similarities, detect patterns, and group data points based on the characteristics and desired clustering techniques [2]. It is an unsupervised learning method that uncovers natural clusters in a dataset, facilitating data exploration and comprehension [3]. However, clustering faces challenges in selecting suitable data representatives, handling diverse data, and dealing with distribution complexities. It's a computationally complex task in the class of NP-complete problems, aiming to minimize dissimilarity measures for identifying clusters in varied datasets [4]. Two fundamental approaches to data clustering include hierarchical clustering, which entails a tree-like division of data, and partition clustering. The objective in data clustering is to determine cluster centers (centroids) and improve the partitioning through iterative relocation, with examples of partition clustering algorithms like K-means [5].

Several nature-inspired optimization algorithms have gained attention in search clustering approaches. These algorithms aim to optimize an objective function considering the sum of intra-cluster distances to find centroids. Examples from the literature include the Gray Wolf Optimization (GWO) [6], the Jaya Algorithm (JAYA) [7], Chaotic League Championship Algorithm (KSCLCA) [8], Salp Swarm Algorithm (SSA) [9], Dandelion Optimizer (DO) [10], Leader Slime Mould Algorithm (LSMA) [11], Flow Direction Algorithm (FDA) [12], Artificial Gorilla Troops Optimizer (GTO) [13], Mountain Gazelle Optimizer (MGO) [14], Prairie Dog Optimization Algorithm (PDO1) [15], Chimp Optimization Algorithm (CHIMP) [16] and Opposition African Vultures Optimization Algorithm (OAVOA) [17].

The Mountain Gazelle Optimization (MGO) algorithm proposed [14], mimics the social behaviors of mountain gazelles and utilizes factors like male herds, maternity herds, territorial males, and migration for food exploration. While MGO excels in benchmark functions and engineering problems, its application to data clustering remains challenging. We propose a variation, Chaos Mountain Gazelle Optimizer (CMGO), which integrates a chaotic map into the distribution strategy to address this issue. Furthermore, the Migration to Search for Food strategy in MGO is unsuitable for clustering problems. To address this, we modify the strategy distribution by integrating a chaotic map into the Territorial Solitary Males strategy while excluding the Migration to search for Food strategy. In data clustering, K-means clustering is commonly used, but it encounters challenges when computing distances between objects and centroids, especially with categorical and binary data. To address this, the Gower distance technique is integrated into K-means clustering. According to [18] using Gower's similarity coefficients improved the accuracy of the K-means algorithm in experiments with various datasets. We selected a total of 28 real datasets from the UCI and OpenML repositories to assess the performance of the proposed algorithms on three data types: numerical, categorical, and mixed. The effectiveness of the algorithm was compared against 14 state-of-the-art approaches. The evaluation employed the F-Measure metric to assess performance, and statistical significance ranking was conducted using the tied rank test. Results revealed that CMGO exhibited lower intra-cluster distance and higher F-Measure values, outperforming both the original Mountain Gazelle Optimization (MGO) algorithm and other tested algorithms. Specifically, CMGO secured the first position in clustering numeric and categorical data, while ranking third for mixed data.

The remainder of the paper is organized as follows: Sect. 2 provides an overview of K-means clustering problems and the traditional MGO algorithm. Section 3 introduces the proposed method, CMGO. In Sect. 4, we discuss the performance evaluation experiments. Section 5 presents the discussion. Finally, Sect. 6 conclusion.

2 Related Works

In this section, we will discuss the K-means algorithm for data clustering and introduce a traditional Mountain Gazelle Optimization (MGO) to enhance its capabilities.

2.1 K-means Clustering

The K-means clustering algorithm has received significant attention in the literature. Especially in nature-inspired optimization approaches, a large number of researchers employ optimization algorithms to search for cluster centers. These optimization algorithms aim to discover cluster centers by minimizing the objective function, which takes into account the sum of intra-cluster distances. The K-means algorithm partitions the dataset into \(K\) distinct clusters. The K-means algorithm operates through unsupervised learning. Based on the data points \(X=[{x}_{1,} {x}_{2} {x}_{3 }, \dots {x}_{N}]\) and the positions of \(K\) cluster centroids \(C=\left\{{c}_{1},{c}_{2},{c}_{3}, \dots ,{c}_{k}\right|\forall i=1, \dots , K: {c}_{i} \ne \varnothing \,\,and\,\, \forall i \ne j: {c}_{i} \cap {c}_{j}= \varnothing \}.\) In clustering, each data point in set \(X\) is assigned to one of the \(K\) clusters in a manner that minimizes the objective fitness function. The sum of the squared Euclidean distance between data points \({x}_{N}\) and the center of the cluster \({c}_{j}\) is used as the objective function, as presented in Eq. (1).

$$f\left(k\right)= \sum_{k=1}^{K}\sum_{i=1}^{{N}_{k}}{({x}_{i}- {c}_{k})}^{2},$$
(1)

where \(k=\mathrm{1,2},...K\) is the number of clusters, \({x}_{i}\), \(i=\mathrm{1,2},...{N}_{k}\) are the patterns in the \({k}^{th}\) cluster, \({c}_{k}\) is center of the \({k}^{th}\) cluster. In this context, the cluster centers are depicted as:

$${c}_{k}=\frac{1}{{n}_{k}} \sum_{i=1}^{{n}_{k}}{x}_{i}.$$
(2)

In this research, nature-inspired algorithms are employed for the purpose of identifying cluster centers within the dataset. The primary objective of the K-means algorithm is to determine optimal centers for each of the \(K\) clusters in a partitioned cluster.

2.2 Traditional Mountain Gazelle Optimization (MGO)

This section provides a brief explanation of the main inspiration behind the traditional MGO algorithm [14], followed by a description of the mathematical model.

Territorial Solitary Males

Male mountain gazelles establish solitary territories through intense territorial battles and competition for females, with adult males vigorously defending their boundaries.

$$TSM={male}_{gazelle}-\left|\left({ri}_{1}\times BH- {ri}_{2}\times X\left(t\right)\right)\times F\right| \times {Cof}_{r}$$
(3)

The \({male}_{gazelle}\) is the position vector of the best global solution, representing an adult male. \({ri}_{1}\) and \({ri}_{2}\) are random integer 1 or 2. The coefficient vector \(BH\) is the young male herd. \({Cof}_{r}\) is randomly selected in each iteration.

$$BH = X_{ra} \times \left\lfloor {r_1 } \right\rfloor + M_{pr} \times \left\lfloor {r_2 } \right\rfloor , ra = \left( {\left\lfloor \frac{N}{3} \right\rfloor \ldots N} \right)$$
(4)

where \({X}_{ra}\) is a random young male within the interval \(ra\). \({M}_{pr}\) is the average number of randomly selected search agents from a pool of \(N\), based on the ceiling division of \(N\) by 3. \(N\) is the total number of gazelles. \({r}_{1}\) and \({r}_{2}\) are random values (0, 1].

$$F= {N}_{1}\left(D\right)\times exp\left(2-Iter \times \left(\frac{2}{MaxIter}\right)\right)$$
(5)

\({N}_{1}\) is a randomly generated number from the standard distribution, \(exp\) is the exponential function, \(MaxIter\) is the total iterations, and \(Iter\) is the current iteration.

$${Cof}_{i}=\left\{\begin{array}{c}\left(a+1\right)+{r}_{3},\\ a\times {N}_{2}\left(D\right),\\ {r}_{4}\left(D\right),\\ {N}_{3}\left(D\right)\times {N}_{4}{\left(D\right)}^{2}\times \mathrm{cos}\left(\left({r}_{4}\times 2\right)\times {N}_{3}\left(D\right)\right),\end{array}\right.$$
(6)

\({r}_{3}\), \({r}_{4}\), and \(rand\) are random numbers (0, 1). \({N}_{2}\), \({N}_{3}\) and \({N}_{4}\) are randomly generated numbers from a normal distribution, and \(cos\) represent the cosine function.

$$a= -1+Iter \times \left(\frac{-1}{MaxIter}\right)$$
(7)

\(MaxIter\) is the total number of iterations, while \(Iter\) is the current iteration count.

Maternity Herds

Maternity herds facilitate robust male gazelle births, with active male participation in delivery and young males competing for dominance over females.

$$MH=\left(BH+ {Cof}_{1,r}\right)+({ri}_{3} \times { male}_{gazelle}- {ri}_{4} \times { X}_{rand} ) \times {Cof}_{r}$$
(8)

\({ri}_{3}\) and \({ri}_{4}\) are random integers, either 1 or 2. The \({male}_{gazelle}\) is the global solution in the current iteration. \({X}_{rand}\) is the position of a gazelle randomly selected from the entire population.

Bachelor Male Herds

Male gazelles establish territories and engage in intense battles for female possession, demonstrating dominance and control. This behavior is computed as follows:

$$BMH=\left(X\left(t\right)-D\right)+\left({ri}_{5}\times { male}_{gazelle} - {ri}_{6}\times BH\right)\times {Cof}_{r},$$
(9)

where \(X\left(t\right)\) is the position vector of the gazelle in the current iteration. \({ri}_{5}\) and \({ri}_{6}\) are randomly selected integers, either 1 or 2.

$$D=\left(\left|X\left(t\right)\right|+\left|{ male}_{gazelle} \right|\right)\times \left(2 \times {r}_{6}-1\right),$$
(10)

the parameter \({r}_{6}\) is a random value between 0 and 1.

Migration to Search for Food

The mathematical formulation representing the foraging and migratory behavior of mountain gazelles incorporates their ability to cover long distances and engage in migration, as well as their exceptional running speed and jumping abilities.

$$MSF=\left(ub-lb\right)\times {r}_{7}+lb$$
(11)

\(ub\) and \(lb\) is upper and lower limits. \({r}_{7}\) is a randomly selected integer within the range of 0 and 1.

The mechanisms (TSM, MH, BMH, and MSF) are applied to all gazelles, generating new generations, and adding to the population. High-quality gazelles are preserved, while weak or old ones are removed, with the adult male gazelle considered the best among them.

3 Proposed method Chaotic Mountain Gazelle Optimizer

3.1 Motivation

The Mountain Gazelle Optimizer (MGO) algorithm draws inspiration from the social structure of wild mountain gazelles. While MGO demonstrates strong search capabilities in benchmark functions and engineering problems [14], its application to NP-complete real-world problems like data clustering remains challenging. To address this, we enhance MGO by incorporating a chaotic map into the Territorial Solitary Males strategy and excluding the Migration to Search for Food strategy. Additionally, we introduce the Gower distance technique to overcome challenges in computing distances for categorical and binary data in K-means clustering.

Chaotic Territorial Solitary Males Strategy

In our proposed CMGO algorithm, the Territorial Solitary Males strategy is enhanced by incorporating a chaotic map. The updated mathematical expression for the territory of adult male \(TSM{C}^{t+1}\) is given by the following equation.

$$TSM{C}^{t+1}={male}_{gazelle}-\left|\left(\left(\frac{{C}^{t+1}}{{ri}_{1}}\right)\times BH- {ri}_{2}\times X\left(t\right)\right)\times F\right| \times {Cof}_{r} ,$$
(12)

\({male}_{gazelle}\) is the position vector of the best global solution. \({ri}_{1}\) and \({ri}_{2}\) are random integers, either 1 or 2. The coefficient vector \(BH\) corresponds to the young male herds from the original MGO algorithm. \(F\) and \({Cof}_{r}\) Similar to the original MGO.

The Chaotic Parameter

The parameters \({ri}_{1}\) and \({ri}_{2}\) serve as controls for updating the territory of the adult male \(TSM{C}^{t+1}\) in our CMGO algorithm. The parameter \({ri}_{1}\) is a random integer, taking a value of either 1 or 2, and directly in fluences the search solution. If \({ri}_{1}\) is 1, the coefficient vector \(BH\) remains unchanged. However, when incorporating a chaotic map into the computation of \({ri}_{1}\), the coefficient vector \(BH\) undergoes changes throughout the entire evolution process. Previous studies have demonstrated the seamless and effective integration of a chaotic map with the biogeography-based optimization (BBO) algorithm [19]. In our proposed CMGO algorithm, we introduce the use of the Piecewise map within the Territorial Solitary Males strategy.

The iterative form of the Piecewise map is defined as:

(13)

where the parameter \(P\) is set to 0.4. The visualization of the Piecewise map is depicted in Fig. 1.

Fig. 1.
figure 1

The behavior of the Piecewise maps employed in our CMGO algorithm.

The Gower Similarity Coefficient

To improve the performance of K-means clustering when dealing with categorical and binary data, a similarity measure, such as the Gower coefficient [20] or the Gower distance technique, is used instead of the squared Euclidean distance to calculate the dissimilarity measure \({D}_{Gow}\left({X}_{n},{C}_{j}\right)\) during the clustering process. The Gower distance (\({D}_{Gow}\)), is employed in this context. The computation of the Gower distance is as follows:

$${D}_{Gow}\left({X}_{n},{C}_{j}\right)= \frac{\sum_{k=1}^{{N}_{k}}{S}_{njk}{\delta }_{njk}{w}_{k}}{\sum_{k=1}^{{N}_{k}}{\delta }_{njk}{w}_{k}}$$
(14)

In the case of binary and categorical attributes, \({S}_{njk}=0\) if \({X}_{nk}= {C}_{jk}\) otherwise \({S}_{njk}=1\). For continuous attributes, \({S}_{njk}=\left|{X}_{nk}- {C}_{jk}\right|/ ({max}_{l}{X}_{lk}-{min}_{l}{X}_{lk})\), where \(l\) run for all non-missing values for the attribute \(k\). If we can compare \({X}_{n}\) and \({C}_{j}\) for the attribute \(k\) then \(\delta_{njk} = 1\), zero otherwise. \({w}_{k}\) is the weight for the attribute \(k\). For simplicity, we will set \({w}_{k}=1\). \({N}_{k}\) the total number of species recorded across both units.

3.2 The Main Process of Proposed CMGO Algorithm

The Chaos Mountain Gazelle Optimizer (CMGO) algorithm is developed to tackle challenges in data clustering for various data types. This enhanced version of the Mountain Gazelle Optimizer (MGO) is specifically designed for K-means clustering solutions.

The relationship between the CMGO optimizer and K-means clustering can be explained as follows: We utilize the CMGO to optimize the cluster centers in K-means clustering. First, it initializes the cluster centers with random positions and then proceeds to perform the K-means algorithm from each of these random positions. Secondly, during the evolutionary process, the CMGO iteratively updates the position of the optimal cluster center. The process continues until it reaches the desired optimal position (the best cluster center). Lastly, all positions are assigned to the cluster centers, resulting in the output of the clustering results. The pseudo-code of the CMGO algorithm is also shown in Algorithm 1.

Algorithm 1 Pseudo-code of K-mean clustering based on CMGO

figure a

4 Experimental Results and Analysis

The experiments were conducted using MATLAB R2022a 64-bit on a desktop computer with an AMD Ryzen 9 5950X 16-Core Processor (3.40 GHz), 32.00 GB RAM, SSD M.2 500 GB, and Microsoft Windows 11 Professional 64-bit operating system.

4.1 Performance Evaluation

The CMGO algorithm was evaluated against competing algorithms using UCI and OpenML datasets, employing the F-Measure metric and tied rank test for statistical significance rankings. The F-Measure, which integrates both precision and recall, served as the metric for evaluating performance, can be calculated by a confusion matrix as follows:

$$F-Measure(x)= \frac{2\times precision\times recall}{precision+recall} \times 100,$$
(17)

where, precision and recall are calculated using the following equations based on a confusion matrix:

$$precision= \frac{TP}{TP+FP},$$
(15)
$$recall= \frac{TP}{TP+FP},$$
(16)

where, TP represents true positives, FP corresponds to false positives, and FN signifies false negatives. Higher precision values indicated superior algorithm performance, while greater recall captured more true positives, thereby indicating improved performance in correctly identifying positive instances. The evaluation of the F-Measure occurs upon termination of the optimization algorithm. A higher F-Measure leads to increased clustering accuracy. In [8] stressed the importance of a low objective fitness value for accurate cluster formation, guiding our adoption of the F-measure metric to evaluate and achieve precise clusters.

The Benchmark Dataset for Clustering

We partitioned the benchmark dataset into three distinct groups: numerical datasets, categorical datasets, and mixed-data type datasets. We utilized a total of 28 well-known datasets taken from the UCI [21] and OpenML [22] repositories. The numerical dataset consisted of: Iris, Glass, Breast-Cancer-Wisconsin, Wine, Thyroid, Synthetic-Control-Charts, Ionosphere, Sonar, Diabetes, Ecoli, and Banknote-Authentication. The categorical dataset included: Balance Scale, Hayes-Roth, Monks, SPECT-Heart, asnd Nursery. The mixed-data type dataset encompassed the following datasets: Acute-Inflammations, Analcatdata-Seropositive, Churn, Cloud, Fruitfly, Haberman, Newton-Hema, Sleuth-Case2002, Socmob, Tae, Heart-Disease, and ACA. By incorporating diverse datasets representing different data types, we aimed to comprehensively evaluate the performance of our algorithm across various scenarios. The characteristics of the three datasets are presented in Table 1.

Table 1. The characteristics of the three datasets.

4.2 Experimental Results

To verify the proposed CMGO, we compared it against 14 algorithms on 28 UCI and OpenML datasets. The algorithms used for comparison included: Opposition African Vultures Optimization Algorithm (OAVOA) [17], Salp Swarm Algorithm (SSA) [9], Artificial Gorilla Troops Optimizer (GTO) [13], Jaya Algorithm (JAYA) [7], Dandelion Optimizer (DO) [10], Gray Wolf Optimization (GWO) [6], modified particle swarm optimization (MPSO) [23], Leader Slime Mould Algorithm (LSMA) [11], Flow Direction Algorithm (FDA) [12], Mountain Gazelle Optimizer (MGO) [14], Prairie Dog Optimization Algorithm (PDO1) [15], Time-varying Acceleration Coefficients Particle Swarm Optimization algorithm (TACPSO) [23], Chimp Optimization Algorithm [16], and Chaotic League Championship Algorithm (KSCLCA) [8].

Table 2, the analysis encompasses numeric, categorical, and mixed data types. The average tied rank of 15 algorithms is determined based on the F-Measure. The average tied rank (Avg. tied rank) displayed in Table 2 represents the mean scores of tied rank scores for the F-Measure of each algorithm when implemented on datasets belonging to their respective data types. For numeric data, CMGO achieves the top rank with an average score (avg.sc) of 4.27, outperforming the original MGO ranking which holds the 6 positions with an avg.sc of 6.55. When considering Categorical data, the algorithm ranking first obtains an avg.sc of 4.40, demonstrating superior performance compared to the original MGO ranking at the 8 positions with an avg.sc of 7.70. In the case of Mixed data, the algorithm secures rank 3 with an avg.sc of 6.46, surpassing the original MGO ranking at the 6 positions with an avg.sc of 6.88.

Table 3 presents a thorough evaluation of data clustering performance, comparing the MGO and the proposed CMGO algorithms across three groups of datasets. The findings consistently indicated that CMGO outperformed the original MGO algorithm across the datasets, with a ratio of 18:9 in favor of CMGO. The performance measures utilized for evaluation were the F-Measure and tied rank.

Table 2. Compares the tied rank of 15 algorithms across three datatypes.
Table 3. The comparative of the tied rank between the MGO and proposed CMGO algorithms.
Table 4. The comparative analysis of the tied rank of 15 algorithms, utilizing the F-Measure metric computed for all 28 datasets.

Table 4 presents the average ranking, tied rank, and average tied rank for all 3 types of information based on the F-Measure. The rankings are associated with a set of 15 algorithms and all datasets. The algorithms are ranked as follows: CMGO, FDA, KSCLCA, MGO, TACPSO, JAYA, GTO, DO, OAVOA, LSMA, SSA, GWO, PDO1, CHIMP, and MPSO, respectively. CMGO achieved the first rank. In contrast, MGO obtained a rank of 4. These results highlight the capability of the proposed CMGO approach to enhance the performance of the original MGO algorithm for all datasets.

5 Discussion

In this section, we compare the exploitation and exploration abilities of the original MGO algorithm and the proposed CMGO algorithm. The evaluation employed strategies such as TSM, MH, BMH, and MSF, revealing differences between the two algorithms (Fig. 2). The Socmob dataset, representing mixed data types, was used to compare their performance. Results indicated that in the original MGO algorithm, there was an initial emphasis on exploitation (as observed in the TSM strategy on the red line), which gradually increased until reaching the final iteration. While its exploration capability (as seen in the MH strategy on the green line) isn't heavily emphasized in the initial stages, it also gradually diminishes until reaching the final iteration. On the contrary, the CMGO algorithm intentionally reduced exploitation (as observed in the TSM strategy on the red line) to prevent premature convergence. However, it experienced a slight reduction that continued until the final iteration. To achieve a more effective balance between exploration and exploitation, the focus on the exploration capability (MH strategy on the green line) begins with less emphasis in the initial stages, increases rapidly during the intermediate stages, and then remains almost constant until the final iteration. The same behavioral curves can be observed for both BMH and MSF strategies. Note that the MSF strategy was removed from the CMGO algorithm due to the negligible changes observed throughout the evaluation process.

Fig. 2.
figure 2

The behaviors of the 4 strategies of the MGO and CMGO for Mixed data type.

It is worth noting, as shown in Table 2, that the mixed data type in all datasets exhibits only two classes (cluster center) and generally has a relatively small number of dimensions, while the other two data types, numerical and categorical, differ. It can be assumed that finding a solution for the problem of mixed data types with fewer classes is not challenging. Some algorithms with a high degree of exploitation ability, such as the KSCLCA algorithm, perform exceptionally well on this problem, ranking first. In contrast, the proposed CMGO algorithm dropped to third place in the context of mixed data types. It can be inferred that the CMGO algorithm aims to enhance the balance between exploration and exploitation abilities by reducing exploitation and increasing exploration, albeit to a lesser degree than the KSCLCA algorithm. Nevertheless, the CMGO's performance ranks it among the top three algorithms, securing the third position, with a slight variation from the second-ranked DO algorithm. In summary, it can be inferred that our proposed CMGO algorithm demonstrates its effectiveness particularly when dealing with problems having more than two classes. Further exhaustive investigations will be pursued in future work.

6 Conclusion

We anticipate that the findings and techniques presented in this study will prove valuable to individuals and researchers who have a keen interest in advancing the field of data clustering. The CMGO algorithm, along with the integration of the Gower distance technique, offers novel insights and solutions for addressing challenges in clustering diverse data types. By introducing the CMGO algorithm, we have expanded the capabilities of the traditional MGO for K-means clustering. The incorporation of a chaotic map into the Territorial Solitary Males strategy and the exclusion of the Migration to Search for Food strategy have enhanced CMGO's exploration and exploitation abilities. This adjustment allows CMGO to effectively handle complex datasets by striking a balance between thorough exploration and efficient exploitation of the solution space. Furthermore, our utilization of the Gower distance technique has overcome the limitations of K-means clustering when dealing with categorical and binary data. This technique has enabled CMGO to accurately compute distances between objects and cluster centers, ensuring reliable clustering results across a wide range of data types. We believe that the comprehensive evaluation of CMGO against 14 other state-of-the-art algorithms using 28 diverse datasets adds significant value to the field. The use of the F-Measure metric and the tied rank test for statistical significance ranking provides robust and reliable measures of CMGO's performance. The results clearly demonstrate CMGO's superiority over the original MGO and other tested algorithms, particularly in clustering pure numeric and categorical data.

In summary, we are confident that the insights and innovations presented in this study will inspire further developments in the field of data clustering. The CMGO algorithm, along with the integration of the Gower distance technique, offers a promising avenue for researchers and practitioners to tackle the challenges posed by diverse datasets. We hope that our contributions will serve as a foundation for future advancements in the field and encourage further exploration and experimentation in this area of study.