1 Introduction

The process of exploring and analysing large data for new, valid, and profitable patterns is termed knowledge discovery. However, due to rapid increments in data generation and storage, it is becoming more and more difficult to retrieve information by traditional analysis methods. Data mining is a task that can be employed to retrieve valuable information and patterns from this large data. Data mining techniques are being used to scour databases so that new and convenient patterns can be effortlessly discovered. Data mining tasks are classified as predictive tasks and descriptive tasks (Tan et al. 2016). Predictive tasks determine the value of a particular attribute based on other attributes. Descriptive tasks derive patterns (correlations, trends, clusters) that summarize underlying relationships. Hence, clustering is a descriptive task that can group the objects based on some similarity measure. Broadly, clustering can be characterized as Partitional and hierarchical. Partitional clustering is grouping objects into non-overlapping clusters based on inter-cluster distances. Hierarchical clustering is a tree clustering either by an agglomerative (Bottom-up) approach or by Divisive (Top-down) approach. Several other clustering methods are reported in the literature (i) graph clustering, (ii) spectral clustering, (iii) model-based clustering, (iv) spectral clustering, (v) density-based clustering, etc. Graph clustering is based on a collection of vertices and edges (Schaeffer 2007). Graph clustering includes grouping of vertices based on edges within a cluster and relatively fewer among other clusters. Spectral clustering is a subset of graph clustering methods that utilize spectral analysis to cluster data points based on their graph representation (Kannan et al. 2004). This clustering method leverages graph theory and spectral analysis (eigenvalue decomposition) to cluster data points based on their similarity or affinity. Spectral clustering is an efficient technique to handle various heuristic problems. Model-based clustering uses the concept of finite mixture models (Schaeffer 2007). Model-based clustering is a statistical clustering approach and it is assumed that the data can be generated from a mixture of underlying probability distributions. In this clustering technique, data can be viewed as a combination of different probability distributions each corresponding to a cluster. In model-based clustering, the goal is to find the best-fitting model of the data by estimating the parameters of the underlying probability distributions. Density-based clustering techniques are designed to find clusters of arbitrary shapes. DBSCAN is a popular density-based clustering example (Hahsler and Bolaños 2016). The DBSCAN counts eps-neighbourhood and identifies core, border, and noise points on user-specified thresholds to estimate density around each data point.

However, in the literature, it is found that Partitional clustering is a prominent one among all clustering methods for data analysis. Partitional clustering is a widely used approach in data analysis, machine learning, and data mining. It divides a dataset into non-overlapping groups, such that each data point belongs to exactly one cluster. This clustering technique aims to minimize within-cluster variance and maximize inter-cluster variance, resulting in clusters that are as distinct and cohesive as possible. While Partitional clustering methods such as k-means and k-mediods are popular due to their simplicity and efficiency, these algorithms have some limitations including sensitivity to initial conditions, potential convergence to local optima, and challenges in determining the optimal number of clusters. To handle these limitations and enhance clustering performance, meta-heuristic algorithms have been proposed as alternatives or enhancements to traditional methods. Meta-heuristic algorithms offer a flexible and adaptive approach to Partitional clustering. These algorithms consist of intelligent search strategies to explore the solution space and optimize clustering assignments. The metaheuristics are optimization algorithms that help in finding the solutions to the complex problems. Thus, metaheuristic algorithms provide a powerful approach to optimizing the different aspects during the clustering process. This helps to improve the cluster quality and can efficiently handle complex clustering problems. Different metaheuristics approaches have been developed and used for optimizing the clustering process. The clustering process using a metaheuristic consists of various steps. The clustering problem is defined by initializing the number of clusters and objective function. Initialize the population and randomly generate the initial set of solutions. The objective function further evaluates the quality of each solution and the fitness values of each solution define the satisfying criteria of the clustering objective. The Metaheuristic approach is used for iterating thru the candidate solution and improving the fitness value and quality of clusters. The best solutions when found are updated in the current population. When the convergence criteria are met, the best solutions are returned as cluster centroid. Further, the quality of clusters can be evaluated using different performance measures or metrics such as compactness, separation, or clustering stability. Further, metaheuristic algorithms also help in improving the quality of clustering by modifying the cluster centres iteratively concerning the fitness requirements such as minimum intra-cluster distance. These algorithms are also capable of handling non-convex clusters through the exploration of intricate search spaces and the determination of non-linear cluster boundaries. However, it also observed that metaheuristic algorithms also have some limitations such as being stuck in local optima, convergence rate, unbalanced search mechanism, population diversity, and initialization issues (Yao et al. 2018; Bahrololoum et al. 2015; Bijari et al. 2018; Chang et al. 2016). Hence, the objective of this survey is to identify the different metaheuristic algorithms available in the literature for Partitional clustering, shortcomings associated with these algorithms, alleviation of the shortcomings, objective functions, and benchmark datasets for clustering. Before proceeding, several research questions are designed to find the accurate outcome for this survey. These research questions are highlighted below. Further, metaheuristic algorithms also help in improving the quality of clustering through modifying the cluster centres iteratively with respect to the fitness requirements such as minimum intra-cluster distance. These algorithms also capable to handle the non-convex clusters through the exploration of intricate search spaces and the determination of non-linear cluster boundaries. However, it also observed that metaheuristic algorithms also have some limitations such as stuck in local optima, convergence rate, unbalanced search mechanism, population diversity, and initialization issues (Yao et al. 2018; Bahrololoum et al. 2015; Bijari et al. 2018; Chang et al. 2016). The visualization in Fig. 1a–d illustrates the examination of meta-heuristics in data clustering using VOS Viewer (Abbasi and Choukolaei 2023). This analysis involved exploring various key terms within research articles from 2015 to 2024 from Science Direct, leveraging meta-heuristics in data clustering. VOS Viewer is a specialized software tool designed for constructing and visualizing bibliometric networks. Widely embraced in academic circles, VOS Viewer facilitates the analysis and visualization of relationships among scientific publications, authors, keywords, and other entities within a specific research domain (Emrouznejad et al. 2023). These visualizations assist researchers in discerning patterns, clusters, and trends within the literature, providing valuable insights into the structure and dynamics of the field under investigation.

Fig. 1
figure 1

ad Network analysis based on meta-heuristics in data clustering keywords

The primary aim of this survey is to identify different metaheuristic algorithms presented in the literature for Partitional clustering, along with their associated shortcomings, methods for mitigating these shortcomings, objective functions, and benchmark datasets for clustering. To achieve this objective, several research questions have been formulated to ensure the accuracy of the survey findings. These research questions are outlined below.

1.1 Research questions (RQ)

The primary survey objective is to find answers to the following Research Questions (RQ):

RQ 1

What are the various meta-heuristic techniques available for clustering problems?

RQ 2

How to handle automatic data clustering?

RQ 3

How to handle high dimensional data (problems) with clustering?

RQ 4

What are the main reasons for hybridizing the clustering algorithms?

RQ 5

What are different objective functions (distance function), different performance measures, and benchmark datasets adopted to evaluate the performance of Partitional clustering algorithms?

1.2 Purpose of this survey

The purpose of this survey paper is to provide a comprehensive review of the field of partitional clustering. This study aims to identify the recent advancement in the context of meta-heuristic algorithms, exploring the structure of the meta-heuristic algorithms and, the strengths and weaknesses of the algorithms for handling the partitional clustering problems. This survey also synthesizes the knowledge from both classical and contemporary approaches for partitional clustering, including optimization-based methods (meta-heuristic algorithms), improved algorithms, hybrid algorithms, and adaptive control parameters. It also highlights the various distance functions adopted as similarity measures for clustering tasks and considers the benchmark datasets that can be adopted for evaluating the efficacy of the clustering algorithms. By examining the strengths, limitations, and potential areas for improvement of these methods, this paper seeks to offer insights into the evolution of partitional clustering and guide future research directions. The goal of this survey is to serve as a valuable resource for researchers for selecting and designing effective meta-heuristic algorithms for complex clustering tasks and for understanding the current state of partitional clustering. To analysis this rich literature, several research questions are designed. The paper is divided into six sections. Section second summarizes the methodology adopted for the survey. The different techniques adopted for cluster analysis are discussed in section three. Section four presents the diverse clustering objective functions, performance metrics, and datasets considered for clustering problems. Section five discusses the various open issues and challenges related to clustering. Section six concludes the entire article, including the research questions devised in section two.

2 Methodology for the survey

This section including research questions, source of information, and inclusion and exclusion criteria of research articles for an effective and efficient survey. Figure 2 illustrates the process of collecting research articles for this survey.

Fig. 2
figure 2

Research articles collection process

2.1 Source of information

The following databases are explored for the domain of data clustering.

2.2 Inclusion and search criteria

The objective is to find various meta-heuristic algorithms for effective handling of clustering problems. Figure 3 describes the process of inclusion and exclusion of research articles. The meta-heuristic algorithms considered meet the following criteria:

  1. (i)

    Related to meta-heuristic algorithms.

  2. (ii)

    Includes data on high dimensional clustering, data clustering, dynamic, and automatic clustering.

  3. (iii)

    Related to single objective and multi-objective clustering.

  4. (iv)

    Work published in between 2015 to 2024.

  5. (v)

    Published in SCI and SCOPUS-listed journals.

Fig. 3
figure 3

Process of inclusion and exclusion of research articles for review

Initial search considered all relevant work with key words: (Data clustering) < OR > (Meta heuristic algorithms) < OR > (Single objective Clustering) < OR > (Multi-objective clustering) < OR > (High dimensional clustering) < OR > (Data clustering) < OR > (dynamic and automatic clustering) < OR > (Graph clustering).The above query generated literature rather than a title or abstract.

2.3 Exclusion criteria

An exclusion criterion is also adopted for the exclusion of non-relevant research papers. Research articles from journals of high repute are only considered (SCI and free Scopus). The exclusion criterion includes research published in books, national and international conferences, magazines, newsletters and educational courses, symposium workshops, and journals of less repute.

2.4 Extraction of articles

Initially, 956 articles are collected from various research databases. A huge amount of research articles were found due to the keyword “clustering”. The next step is to exclude non-relevant as per the criteria. It resulted in 455 research articles. Further, research articles published in the journal of repute are considered by manually removing articles from non-repute journals, books, and magazines. It resulted in the exclusion of 182 more research articles. During the study, 189 research articles didn’t fit well in the predefined search criteria. Finally, 130 research articles are analysed during the survey. Table 1 illustrates the data. Further, a team of four researchers is formed to manually select articles on predefined search criteria. Initially, two researchers select the articles, and the selected articles are further crosschecked by the third and fourth researchers. In case of a conflict, a collective decision has been taken by the team. This process has been repeated in every phase of study selection. Table 1 and Fig. 3 illustrate journals considered for the survey.

Table 1 Journal composition after selection

Figure 3 provides a comprehensive visualization of the distribution of research articles across various journals within the surveyed literature. The figure presents a tabular representation with three columns: Sr. No., Journal Name, Publisher, and No. of Papers. Each row in the table corresponds to a specific journal and includes details such as the journal name, publisher, and the number of papers published within the surveyed literature. This detailed breakdown allows for a clear understanding of the publication landscape and the relative contribution of each journal to the body of research on clustering algorithms. From prestigious publishers like Elsevier and Springer to specialized journals such as IEEE Transactions, the table encompasses a wide array of publication outlets. It highlights the diversity of sources from which researchers draw when exploring clustering algorithms, reflecting the interdisciplinary nature of the field. By presenting this information in a structured and easily digestible format, Fig. 4 offers valuable insights into the dissemination of knowledge within the clustering research community, aiding researchers in identifying key journals and publishers within the domain.

Fig. 4
figure 4

Box and whiskers diagram representation of article composition during the survey

2.5 Data classification process

Finally, articles are classified into five and explored thoroughly to find key points for comparative study. Articles are reanalysed and evaluated on parameters (i) Algorithm/methodology used (ii) Type of clustering (iii) Data sets used (iv) Performance metrics and (v) Authors.

3 Literature survey

The literature survey is divided into five subsections.

This section analyses various meta-heuristic algorithms reported for clustering problems. Further, clustering problems are divided into Partitional clustering, dynamic and automatic clustering, and fuzzy clustering.

3.1 Meta-heuristic algorithms for partitional clustering

Meta-heuristic algorithms are higher-level procedures and heuristics for optimization problems. These algorithms are optimization algorithms inspired by natural phenomena such as biological evolution and swarm behaviour. These algorithms aim to find the optimal and near-to-optimal solution for Partitional problems. Further, several assumptions are taken into consideration for solving optimization tasks. These algorithms have been applied to clustering tasks to improve the quality of the clustering process and overcome challenges such as determining the optimal number of clusters, handling complex data distributions, and dealing with outliers. In this section, we explore improved meta-heuristic clustering algorithms that have been developed to enhance clustering performance, focusing on novel strategies and recent advancements. Meta-heuristic clustering algorithms, such as Genetic Algorithms (GAs), Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO), use population-based search strategies to optimize clustering objectives. In Partitional clustering, these algorithms aim to find a set of cluster assignments that maximize intra-cluster similarity while minimizing inter-cluster similarity. Moreover, the data are partitioned into a fixed number of clusters using some distance measures. It is also noticed that the number of clusters is fixed and known in advance. In Partitional clustering, Euclidean distance is applied to determine the optimal set of clusters in most cases. Partitional clustering is also known as non-overlapping clustering because the data belongs to only one cluster. The popular example of Partitional clustering is K-mean and it is also known as hard clustering. Table 2, illustrates Partitional clustering literature during the survey. Table 2, illustrates Partitional clustering literature in terms of meta-heuristic algorithms that can be applied for improving the efficacy of the clustering problems.

Table 2 Illustrates partitional clustering

3.1.1 Meta-heuristic algorithms for dynamic and automatic partitional clustering

Dynamic and automatic clustering is a sub-branch of Partitional clustering that focuses on grouping data points into meaningful clusters in scenarios where the data itself is changing over time, or new data is constantly being added. This presents a challenge because static clustering techniques, which rely on fixed data sets, might not be suitable for data that evolves. Dynamic clustering techniques aim to adapt to changes in the data set by adjusting cluster structures and numbers as new data is introduced or as data distribution changes. Automatic clustering involves algorithms that automatically determine the optimal number of clusters and other parameters required to generate the clusters. When combined, dynamic and automatic clustering can provide an effective approach for evolving data sets without requiring extensive manual intervention. Recently, meta-heuristic algorithms are optimization algorithms that can be used effectively in dynamic clustering because they provide flexible and efficient methods for exploring the search space. These algorithms are particularly useful in solving complex optimization problems and can adapt to changing environments. These meta-heuristic algorithms can be applied to dynamic and automatic clustering by defining an appropriate objective function, such as minimizing intra-cluster distance or maximizing inter-cluster distance. As the data changes over time, these algorithms can adapt the clusters accordingly, ensuring that the clustering remains relevant and meaningful. This clustering includes very large data, data streams, incomplete data, noisy data, unbalanced data, and structured data. In dynamic and automatic clustering, it is important to evaluate the model performance regularly, ensuring that the clusters remain meaningful as the data evolves. The choice of the specific algorithm will depend on the characteristics of the data set, including its size, dimensionality, and the rate at which it changes over time. This subsection highlights the recent work reported on dynamic and automatic Partitional clustering. Table 3, illustrates various dynamic and automatic clustering algorithms considered during the survey.

Table 3 Illustrates dynamic and automatic clustering

3.1.2 Meta-heuristic algorithms for fuzzy clustering (generalization of the partitional clustering)

Fuzzy clustering is also known as soft clustering. It is a generalization of the Partitional clustering method. In this clustering, each data can belong to more than one cluster. Fuzzy clustering is a type of clustering approach where each data point can belong to more than one cluster with a certain degree of membership. In contrast to traditional (hard) clustering methods, such as k-means, where each data point is assigned to one and only one cluster. Fuzzy clustering is particularly useful when the boundaries between clusters are not clear-cut, or when the data itself is inherently ambiguous or overlapping. The most commonly used fuzzy clustering algorithm is Fuzzy C-Means (FCM), introduced by Jim Bezdek in 1981. FCM is an extension of the classic k-means algorithm that allows data points to have partial membership in multiple clusters. Fuzzy clustering is widely used in various applications such as pattern recognition, data analysis, image segmentation, and bioinformatics, where overlapping or ambiguous groups may exist in the data. Further, Meta-heuristic algorithms can be employed in fuzzy clustering to optimize the clustering process, particularly in terms of finding the optimal number of clusters, the best initial cluster centroids, or the optimal fuzziness parameter (m). The most common fuzzy clustering algorithm is Fuzzy C-Means (FCM), but it can suffer from limitations such as sensitivity to initial conditions and local optima. Meta-heuristic algorithms can help improve the performance of fuzzy clustering by exploring a broader search space and finding better solutions. By integrating meta-heuristic algorithms with fuzzy clustering, more robust, flexible, and efficient clustering results can be obtained in complex data environments. Table 4, highlights the recent work reported on fuzzy clustering. Fuzzy clustering is widely used in various applications such as pattern recognition, data analysis, image segmentation, and bioinformatics, where overlapping or ambiguous groups may exist in the data. Further, Meta-heuristic algorithms can be employed in fuzzy clustering to optimize the clustering process, particularly in terms of finding the optimal number of clusters, the best initial cluster centroids, or the optimal fuzziness parameter (m). The most common fuzzy clustering algorithm is Fuzzy C-Means (FCM), but it can suffer from limitations such as sensitivity to initial conditions and local optima. Meta-heuristic algorithms can help improve the performance of fuzzy clustering by exploring a broader search space and finding better solutions. By integrating meta-heuristic algorithms with fuzzy clustering, more robust, flexible, and efficient clustering results can be obtained in complex data environments. Table 4, highlights the recent work reported on fuzzy clustering.

Table 4 Illustrates fuzzy clustering

3.1.3 Improved meta heuristic algorithm for partitional clustering

Meta-heuristic algorithms can explore the search space to determine solutions to optimization problems. But, sometimes it is not possible to explore the entire search space through a meta-heuristic algorithm. As these algorithms are not exact; so to enhance the performance of meta-heuristic algorithms, a few amendments can be made to improve the efficiency and effectiveness of meta-heuristic algorithms. These amendments can be described as using neighbourhood concepts, defining new search strategies, making the algorithmic parameters adaptive, etc. The improved meta-heuristic algorithms can be described by enhancing their efficiency, convergence speed, exploration–exploitation balance, and robustness in solving Partitional-clustering problems. It can be understood as combining different meta-heuristic algorithms according to their strengths and offset individual weaknesses. Further, integrating the local search methods with meta-heuristics can refine solutions in promising areas of the search space. Dynamically adjust the parameters of the algorithms based on feedback from the search process so that these algorithms can adapt more effectively to solve the clustering problems. Also, design the procedure for algorithms to self-adapt parameters automatically during the search. These improvements can be tailored and combined in various ways depending on the specific problem and application. Research and innovation in meta-heuristic algorithms continue to evolve, and new approaches and enhancements are regularly being proposed in the academic and research communities. Hence, this section summarizes the improvements reported in original meta-heuristic algorithms for effectively solving clustering problems. Table 5, illustrates various improved metaheuristic algorithms in literature.

Table 5 Improved metaheuristic

3.1.4 Hybrid metaheuristic algorithm for partitional clustering

Hybridization is a warm area of research to improve and enhance the performance of algorithms. A hybrid meta-heuristic algorithm combines different meta-heuristic approaches or integrates a meta-heuristic with other optimization techniques to take advantage of their respective strengths while mitigating weaknesses. In the context of clustering, a hybrid meta-heuristic algorithm can optimize cluster assignments and centroids while balancing exploration and exploitation in the search process. Hybrid meta-heuristic algorithms for Partitional clustering combine the strengths of different optimization techniques to achieve better clustering results. Partitional clustering involves dividing the dataset into disjoint clusters where each data point belongs to exactly one cluster. A hybrid meta-heuristic algorithm for Partitional clustering can enhance the clustering process by improving the selection of initial cluster centres, balancing exploration and exploitation during the search process, and increasing the algorithm’s robustness and efficiency. Hybrid meta-heuristic algorithms can be fine-tuned and adapted based on the specific clustering problem and dataset characteristics. This approach can be particularly beneficial for complex clustering problems where traditional methods may struggle. By leveraging the strengths of multiple meta-heuristic approaches, hybrid algorithms can potentially outperform individual methods, offering more robust and effective solutions for clustering problems. Hence, this section aims to present various hybrid meta-heuristic algorithms reported for solving clustering problems. Table 6, illustrates various hybrid metaheuristic algorithms for clustering in literature.

Table 6 Hybrid approaches

4 Objective function, performance metric and dataset

This section describes various objective functions, performance metrics, and datasets used to solve clustering problems.

4.1 Objective function

Clustering is an unsupervised technique that can be applied for data exploration. Clustering aims to find a group of data, known as clusters. An objective function is required to find these groups of data. The objective function is a distance-based function that can measure the distance between data and clusters. Hence, the objective function in clustering aims to determine the quality of clusters. This can be described in terms of cluster compactness. The cluster compactness can be defined as the total distance of each cluster data to the cluster centroid. There are a lot of objective functions presented in the literature for effective clustering. Without these, the clustering cannot be performed. For effective clustering, it is necessary to pick the appropriate clustering objective. Table 7 depicts the well-known clustering objective reported for the clustering task. It is seen that Euclidean distance is a widely adopted and popular objective function for clustering problems. Table 7, illustrates the objective functions studied during this survey.

Table 7 List of objective functions

4.2 Performance metrics

The performance metrics are used to evaluate the performance of the clustering algorithm. The performance metrics should be independent and reliable measures that can assess and compare the experimental results of the clustering algorithm. Based on comparison, the validity of a clustering algorithm is described. In general, to evaluate the performance of the clustering, two evaluations are used i.e. external evaluation and internal evaluation. The external evaluation contains the information of the dataset. The internal evaluation can be described as the evaluation of the dataset itself. Performance metrics like accuracy, f-measure, normalized mutual information, and rand index are commonly used in external evaluation. Performance metrics like the Davies-Bouldin index, Silhouette index, Dunn index, and Entropy are used for internal evaluation. This paper also focuses on different performance metrics reported for clustering algorithms to assess the performance. It is seen that 42 performance metrics are reported in the literature. Table 8 illustrates the performance metrics reported in the literature. It is observed that widely adopted performance metrics are NMI, rand index, accuracy, entropy, f-measure, and error rate. Figure 5 presents a dynamic 3D pie chart, offering a visual representation of key aspects related to clustering algorithm performance assessment. The chart portrays an intricate interplay of various metrics, each contributing to the evaluation of clustering algorithms. As the pie chart rotates, viewers can observe the distribution and significance of different performance metrics within the clustering domain. Additionally, the performance metrics prevalent in the literature, shed light on the diversity and breadth of assessment criteria utilized by researchers. Among these metrics, certain indicators emerge as particularly prominent and widely embraced within the research community. Noteworthy examples include Normalized Mutual Information (NMI), Rand Index, Accuracy, Entropy, F-measure, and Error Rate. Their prevalence underscores their significance in gauging the effectiveness and efficiency of clustering algorithms across various applications and scenarios.

Table 8 List of different performance measures reported in the literature
Fig. 5
figure 5

3-D pie chart for performance measures

4.3 Dataset

The dataset also plays an important role in validating the performance of clustering algorithms. Clustering is an unsupervised method. Therefore, when a clustering algorithm is implemented no class information is given. The objects are assigned to different clusters based on the objective function. Some external evaluations are used to assess the performance of the clustering algorithm. These evaluations require the class information (cluster information). Moreover, some datasets are linearly separable, whereas some others are non- linearly separable. The performance of the clustering algorithm may be affected due to the above-mentioned properties of data. Another point, the simulation results of the clustering algorithm also depend on attribute types, dimensions of the dataset, size of data, etc. This study also highlights the various datasets that are used to evaluate the performance of clustering algorithms. It is seen that forty datasets are reported in the literature to evaluate the performance of the clustering algorithms. Table 9 demonstrates the list of these datasets. It is also revealed that iris, wine, glass, CMC, vowel, cancer, breast cancer, and thyroid datasets are widely used datasets to evaluate the performance of clustering algorithms.

Table 9 List of datasets adopted to evaluate simulation results

Figure 6 showcases a dynamic 3D pie chart, providing a comprehensive overview of the datasets commonly utilized in assessing clustering algorithm performance. The chart captures the diversity and breadth of datasets employed in clustering research, offering insights into the range of scenarios and applications where these algorithms are applied. Each segment of the pie chart represents a specific dataset, with the size of the segment corresponding to the relative frequency or significance of its usage in clustering algorithm evaluation. Notably, the chart underscores the prevalence of certain datasets such as iris, wine, glass, CMC, vowel, cancer, breast cancer, and thyroid, which emerge as widely adopted benchmarks for assessing clustering algorithms. This visualization serves as a valuable reference for researchers and practitioners, providing a visual depiction of the dataset landscape and highlighting key datasets that have become standard benchmarks within the clustering community. By presenting this information in a visually accessible format, Fig. 5 facilitates a deeper understanding of the datasets employed in clustering research and their role in algorithm evaluation.

Fig. 6
figure 6

3-D pie chart for datasets adopted to evaluate simulation results

5 Issues and challenges

This section summarizes the various issues that can be addressed through meta-heuristic algorithms. It is observed that large numbers of meta-heuristic algorithms are taken into consideration to solve the clustering problems effectively.

5.1 Issues in partitional clustering

In Partitional clustering, various meta-heuristic algorithms are applied to solve clustering problems effectively. The main reasons for adopting the meta-heuristic algorithm for Partitional clustering are listed.

  1. (i)

    To determine near-optimal solutions for Partitional clustering problems.

  2. (ii)

    To evaluate optimal centroid for effective clustering.

  3. (iii)

    To determine similar patterns in categorical data.

  4. (iv)

    To handle heterogeneous data.

  5. (v)

    To determine subspace clusters in the dataset.

  6. (vi)

    To handle multimodal and heterogeneous data for effective clustering.

  7. (vii)

    To perform clustering of high dimensional data.

  8. (viii)

    To handle the educational data mining.

5.2 Issues in dynamic and automatic clustering

From the extensive literature survey, it is inferred that some meta-heuristic algorithms are also adopted in the field of dynamic and automatic clustering. The main reasons for applying the meta-heuristic algorithm are listed.

  1. (i)

    To enhance the convergence rate of algorithms.

  2. (ii)

    To avoid stagnation and premature convergence.

  3. (iii)

    To develop an optimization strategy for dynamic clustering.

  4. (iv)

    To handle dynamic streams automatically.

5.3 Issues in fuzzy clustering

In the field of fuzzy clustering, some meta-heuristic algorithms are also reported. These algorithms aim to improve the quality of solutions, especially for fuzzy clustering problems. The issues handled by these algorithms are listed.

  1. (i)

    To generate optimum cluster centres using the fuzzy membership function.

  2. (ii)

    To handle high-dimensional dataset.

  3. (iii)

    To determine relevant features in case of high dimensional data.

  4. (iv)

    To develop accurate prediction models.

  5. (v)

    To improve the quality of solutions.

  6. (vi)

    To handle data streams in an effective manner.

5.4 Issues in improved meta heuristic algorithm for clustering

This subsection demonstrates various issues related to the performance of the meta-heuristic algorithm and the need to improve these algorithms for efficiently solving clustering problems. The various shortcomings associated with meta-heuristic algorithms and successfully addressed through improved versions of meta-heuristic algorithms. The main reasons to improve the meta-heuristic algorithms are listed.

  1. (i)

    To overcome the slow convergence rate of meta-heuristic algorithms.

  2. (ii)

    To avoid premature convergence problem.

  3. (iii)

    To reduce noise effect and improve quality of solutions.

  4. (iv)

    To handle clustering in a hierarchical manner.

  5. (v)

    To reduce computational cost.

  6. (vi)

    To effective trade-off between local search and global search.

  7. (vii)

    To tackle overlapping and incremental clustering.

  8. (viii)

    To handle constraints in an effective manner.

5.5 Issues in hybrid meta heuristic algorithm for clustering

The issues that can be addressed through hybrid meta-heuristic algorithms are listed.

  1. (i)

    To overcome the shortcomings of traditional clustering algorithms like local optima and improve the quality of results.

  2. (ii)

    To remove infeasible solutions generated during execution.

  3. (iii)

    To handle local optima and convergence issues of meta-heuristic algorithm.

  4. (iv)

    To improve search mechanisms of algorithms.

  5. (v)

    To effectively handle exploration and exploitation processes.

  6. (vi)

    To address the initialization issues of clustering algorithms.

  7. (vii)

    To explore more promising solutions for clustering problems.

  8. (viii)

    To explore solution search space in an effective and efficient manner.

  9. (ix)

    To generate a neighbourhood solution.

6 Conclusion

In this survey, a large number of meta-heuristic algorithms are analysed concerning clustering applications. It is inferred that clustering problems can be classified in terms of Partitional, dynamic, and fuzzy clustering. A diversity of algorithms are reported in the literature to solve clustering problems effectively and efficiently. Some algorithms address issues related to performance, population diversity, local optima, search strategies, neighbourhood solutions, number of clusters, optimized cluster centres, etc. This paper presents a survey of high-repute publications in a particular period (2015–2021). These articles are categorized into Partitional, dynamic & automatic, and fuzzy clustering. Moreover, they are further classified into meta-heuristic, improved meta-heuristic, and hybrid meta-heuristic algorithms. Before the literature survey, several research questions are designed for an effective and efficient survey. The major contributions of this literature survey to the scientific community are.

RQ 1

What are the various meta-heuristic techniques available for clustering problems?

Answer: Large numbers of meta-heuristic algorithms employed to solve clustering problems are analysed. Several new algorithms are developed to solve these problems (CSS, MCSS, Bird flock algorithm, Electromagnetic force based algorithm, Magnetic optimization algorithm, Gravity algorithm, Big Bang Big Crunch algorithm). It is observed that these algorithms provide significant results in contrast to PSO, SA, TS, ACO, GA, and K-means etc. It is also observed that a smaller number of algorithms are based on traditional mathematical models. All recently developed algorithms are inspired by some natural phenomenon like the Big Bang Big Crunch, well-established laws like gravity law, and swarm behaviour (cuckoo optimization inspired through cuckoo’s behaviour). Tables 2, 3, 4, 5, 6 summarizes various algorithms.

RQ 2

How to handle automatic data clustering?

Answer: Dynamic & Automatic clustering problems are an active area of research due to online, web, and social mining. In these problems, the number of clusters is undefined, and clusters are designed according to the nature of the data. It is observed that several single-objective clustering algorithms are proposed to address the dynamic clustering problem. Again, these algorithms are based on natural phenomena (swarm behaviour). A few multi-objective algorithms are developed to handle dynamic clustering problems. Hence, it can be concluded that a lot of attention soon will be formed in this direction.

RQ 3

How to handle high dimensional data (problems) with clustering?

Answer: At present, a large number of data is generated, and this volume is increasing exponentially. This data contains meaningful patterns, but it is not an easy task to explore and analyze these patterns. So, to handle large data problems and extract meaning, several meta-heuristic clustering algorithms are proposed. A few are integrated with Hadoop (a parallel architecture) to retrieve and process data much faster than traditional approaches. Some ensemble clustering methods can handle high-dimensional data. It is seen that lack of multi-objective clustering methods to handle the aforementioned issues.

RQ 4

What are the main reasons for hybridizing the clustering algorithms?

Answer: Many improved and hybridized versions of algorithms are proposed. An algorithm is either improved/hybridized due to shortcomings associated with it or to avoid shortcomings related to problems being solved. Through the literature survey, it is observed that several shortcomings are associated with algorithm and clustering problems. These are local optima, convergence rate, population diversity, boundary constraints, neighbourhood solution structure, the effective trade-off between local and global searches of the algorithm, solution search mechanism, solution search equations, and dependence on random functions. It is also observed that hybridization is an active area of research and hybridization of an algorithm can improve its performance. Hence, to overcome the aforementioned problems, an algorithm can either be improved or hybridized to obtain significant and optimized results. Till date, there is no generic algorithm for solving all types of clustering problems and data (categorical, nominal, numeric, text, and binary).

RQ 5

What objective functions, performance measures, and datasets are adopted to evaluate the performance of clustering algorithms?

Answer: Large numbers of performance measures are employed to evaluate the performance of clustering algorithms. Table 8 contains performance measures, which are reported in the literature. It is observed that NMI, rand index, accuracy, inner and inter-cluster distance, and F-measure are widely adopted performance measures. Table 7 summarizes objective functions to find closeness between data objects. Ten objective functions are reported in the literature, Euclidean Distance is a widely adopted objective function. To evaluate performance various datasets reported in the literature are summarized in Table 9. It is analysed that Iris, Wine, Glass, Haberman, CMC, Vowel, and Breast cancer are the most significant (benchmark) datasets for evaluation. Highlights of the survey are listed.

  • 130 SCI and/or Scopus (Free) articles are included from 70 journals that are published (2015-2024).

  • Euclidean distance is adopted as a significant distance to determine closeness between data objects.

  • It is analysed that partitional clustering is a widely adopted problem.

  • Improved and enhanced meta-heuristic algorithms are hybrid algorithms for effective and efficient clustering of data.

  • It is analysed that hybrid meta-heuristic algorithms are the more significant approach to handling various clustering problems.

  • Fuzzy and Automatic data clustering is a new and active area of research.

  • Lack of work reported on multi-objective data clustering, which leads to a scope in this direction.

In this survey, we have undertaken a comprehensive analysis of various meta-heuristic algorithms in the context of clustering applications. Our investigation has shed light on the diverse landscape of clustering problems, which can be classified into Partitional, dynamic, and fuzzy clustering categories. Through an extensive review of the literature published between 2015 and 2024, we have identified a multitude of algorithms that address key challenges associated with clustering, including performance, population diversity, local optima, and search strategies. Our survey has revealed the emergence of several novel meta-heuristic techniques for solving clustering problems, such as CSS, MCSS, Bird flock algorithm, Electromagnetic force-based algorithm, Magnetic optimization algorithm, Gravity algorithm, and Big Bang big crunch algorithm. These algorithms have demonstrated promising results compared to traditional methods like PSO, SA, TS, ACO, GA, and K-means, showcasing the effectiveness of leveraging natural phenomena and established laws as inspiration for algorithm design.

Additionally, we have explored the ongoing research efforts in dynamic and automatic clustering, which are driven by the growing demand for real-time data analysis in domains like online, web, and social mining. While single-objective clustering algorithms have made significant strides in addressing dynamic clustering challenges, there remains a need for the development of multi-objective algorithms to handle the complexity of evolving datasets more effectively. Furthermore, our survey has highlighted the importance of addressing the challenges posed by high-dimensional data in clustering. With the exponential growth of data volumes, there is a pressing need for meta-heuristic clustering algorithms capable of handling large-scale datasets efficiently. Integration with parallel architectures like Hadoop and the exploration of ensemble clustering methods represent promising avenues for addressing these challenges in the future. While our survey has provided valuable insights into the state-of-the-art in clustering, it is essential to acknowledge certain limitations inherent in our study. From a theoretical standpoint, the complexity of clustering problems and the diversity of datasets make it challenging to devise a one-size-fits-all solution. Moreover, practical limitations, such as computational resources and algorithm scalability, may impact the applicability of certain clustering techniques in real-world scenarios.

Moving forward, future research in clustering should focus on addressing these limitations and exploring new avenues for improvement. One promising direction is the development of hybrid meta-heuristic algorithms that combine the strengths of different optimization techniques to overcome the shortcomings of individual approaches. Additionally, there is a need for more extensive benchmarking of clustering algorithms using diverse datasets and performance metrics to ensure robustness and generalizability of results. In conclusion, our survey has provided valuable insights into the state-of-the-art meta-heuristic clustering algorithms and identified key areas for future research. By addressing the challenges posed by clustering in the era of big data, we can unlock new opportunities for knowledge discovery and decision-making in various domains.