1 Introduction

Rapid development in different areas of science and technologies requires the storage of huge amount of data in the database. These datasets are often very complex and involve many parameters. The huge amount of data does not give any useful information without processing. As manual processing of these datasets is beyond the scope of human capacities, hence, people motivate toward the computing technologies for automating a model that can solve the desired purpose [1]. Data mining is one of the most important techniques that extract useful information from a large-scaled dataset to achieve the desired goal. It basically involves three steps to target the desired objective(s). The first step is responsible for preparing the data, called the scrubbing of data. In the second step, a suitable data mining algorithm needs to be selected. In the last step, data are analyzed. Among these, the selection of suitable data mining algorithm improves the overall performance of the model. In addition, data mining has various methods such as summarization, association, and data clustering to discover the hidden patterns in the large database for practical applications. Among these, data clustering is one of the most popular methods and gained a growing research interest in the last few decades.

The objective of clustering is to partition N data objects into K clusters. The data objects in a cluster are highly similar to one another; however, they are highly dissimilar with the data objects of other clusters [2]. Clustering is being widely used in different domains of science and engineering such as community detection [3], web page recommendation [4], text mining [5], image segmentation [6], stock market prediction [7], and many more. Clustering is an optimization problem, and various optimization algorithms have been suggested recently to achieve the desired goal. Broadly speaking, these algorithms are called nature-inspired algorithms that use natural phenomenon during the optimization process. Nature-inspired algorithms are broadly classified into two subclasses: evolutionary algorithms [8, 9] and swarm intelligence-based algorithms [10, 11].

Various heuristic approaches based on evolutionary algorithms have been proposed for data clustering in the last few decades [12]. Maulik et al. [13] proposed a clustering approach based on differential evolution (DE) for image classification. The authors have justified the efficacy of their proposal. Das et al. [14] suggested an approach based on DE for clustering the pixels of an image in the gray-scale intensity space. A survey on clustering based on nature-inspired metaheuristic algorithms is given in [15]. Recently, various swarm intelligence-based algorithms have been developed that are later applied for solving data clustering problems. The artificial bee colony (ABC) algorithm-guided data clustering approach has been suggested in [16, 17]. An extensive survey is presented in [18, 19] on data clustering approaches based on particle swarm optimization. A gray wolf optimizer-based clustering algorithm is proposed in [20]. In the last few decades, chaotic sequences created by chaotic maps have been used in optimization algorithms to improve their performances [21]. In this context, Chuang et al. [22] proposed an approach based on a chaotic map and PSO for solving the data clustering problem. Li et al. [23] proposed a clustering algorithm based on a chaotic PSO and gradient method. Wan et al. [24] suggested a chaotic ant swarm approach for data clustering.

Jamshidi et al. [25] proposed an adaptive neuro-fuzzy inference system to identify dynamic behaviors of a lithium ion. Authors have justified the effectiveness of their approach in minimizing the errors while handling the systems with multiple nonlinear behaviors and uncertainties. An adaptive neuro-fuzzy inference system based on a subtractive clustering algorithm is proposed in [26] for system identification of remaining useful life in the electrolytic capacitors. A new multiobjective approach for detecting money laundering is presented in [27]. Jamshidi et al. [28] proposed a neuro-fuzzy-guided technique to model a Li-ion battery used in a small satellite of NASA. Authors have claimed the competitiveness of their approach while achieving the desired goal. In addition, variety of optimization problems are being solved using appropriate optimization algorithms [29,30,31].

Recently, a swarm intelligence-based algorithm, called Harris hawks optimization (HHO) [32], is proposed for solving global optimization problems. The main inspiration of HHO is the cooperative behavior and chasing style of Harris’ hawks in nature called surprise pounce. HHO has already proved its effectiveness and competitiveness for solving complex problems. In this paper, a novel approach based on chaotic sequence and HHO (CHHO) is suggested for solving the data clustering problem. In short, the novelty and major contributions are given below:

  • A chaotic Harris hawks optimizer (CHHO) is proposed for data clustering for the first time.

  • Twelve standard benchmark datasets have been utilized to evaluate the performance of the suggested approach.

  • Six well-known recently developed nature-inspired algorithms have been considered to compare the performance against the suggested approach.

  • To prove the efficacy of the suggested approach statistically, three statistical tests have been performed.

The rest of the paper is organized as follows: Section 2 describes the basic idea of clustering and Harris hawks optimizer. In Sect. 3, the proposed approach is described in detail. A description of used datasets and experimental setup is given in Sect. 4. Analysis of experimental results is given in Sect. 5. Finally, Sect. 6 concludes the proposed approach and highlights some future research work.

2 Preliminaries

2.1 Clustering

Clustering is the process of classifying N data objects into K clusters in such a way that the sum of the intra-cluster distance should be minimized and the sum of the inter-cluster distance should be maximized [15]. Mathematically, a dataset with N data objects is represented as \(D = \{D_1, D_2,\ldots , D_N\}^{\mathrm{T}}\), where \(D_i\) = \(\{d_i^1, d_i^2, \ldots , d_i^f\}\). Here, f is number of features or attributes also called dimensionality in a given data object of a dataset and \(d_i^j\) is the jth feature of \(D_i\). Here, a dataset can be represented in the form of a matrix of size \(N\times f\) as follows:

$$\begin{aligned} D = d_i^j, \quad 1\le i\le N \quad {\text{and}}\quad 1\le j\le f. \end{aligned}$$
(1)

In fact, the objective of clustering is to partition dataset D into K clusters, \(C_1, C_2, \ldots , C_K\), where the data objects within a cluster should be as similar as possible; however, the data objects of different clusters should be as distinct as possible. The fitness function for calculating the sum of the intra-cluster distances is given below:

$$\begin{aligned} F(D,Z) = \sum _{i=1}^N\sum _{k=1}^Kx_{ik}||(D_i-Z_k)||^2 \end{aligned}$$
(2)

where F(DZ) is the sum of the intra-cluster distances also called fitness value that needs to be minimized, and \(||(D_i-Z_k)||\) is the Euclidean distance between a data object \(D_i\) and the cluster center \(Z_k\). \(x_{ik}\) is the association weight of data object \(D_i\) with cluster k, which will be 1 if data object i is assigned to cluster k; otherwise, \(x_{ik} = 0\).

2.2 Harris hawks optimization (HHO)

HHO is one of the swarm intelligence-based techniques for solving complex problems. This algorithm is inspired by the cooperative behavior and chasing style of Harris’ hawks in nature called surprise pounce. Any optimization algorithm does try to capture the global optimal solution while implementing the mechanism of exploration and exploitation. These optimization algorithms start their execution with random initialization of the candidate solutions. In HHO, Harris’ hawks are the candidate solutions and the best candidate solution in each step is considered as the intended prey or nearly the optimum. Based on the escaping energy of the prey, HHO decides that candidate solutions (Harris’ hawks) would explore or exploit in the search domain. The escaping energy (E) depends on the current iteration (t), maximum number of iterations (T), and initial state energy \((E_0)\) of the prey which is calculated as follows:

$$\begin{aligned} E=2E_0 \left( 1-\frac{t}{T}\right) \end{aligned}$$
(3)

Here \(-1<E_0<1\). If \(|E|\ge 1\), HHO enters into the exploration phase. In this phase, hawks search for different regions to target the rabbit location. Here, the rabbit location is the location of the intended prey. On the other hand, if \(|E|>1\), HHO does try to implement the exploitation phase. In this phase, hawks search in the neighborhood of their location.

2.2.1 Exploration phase

In this phase, candidate solutions update their position vectors as follows:

$$\begin{aligned} X_{{t + 1}} = \left\{ {\begin{array}{ll} {X_{\mathrm{rand}} \left( t \right) - r_{1} \left| {X_{\mathrm{rand}} \left( t \right) - 2r_{2} X\left( t \right) } \right| ,} &{} \quad {q \ge 0.5} \\ {\left( {X_{\mathrm{rabbit}} \left( t \right) - X_{m} \left( t \right) } \right) - r_{3} \left( {lb + r_{4} \left( {ub - lb} \right) } \right) ,} &{} \quad {q < 0.5} \\ \end{array} } \right. \end{aligned}$$
(4)

where \(X_{t+1}\) and \(X_t\) are the updated and current position vectors of hawks. \(X_{\mathrm{rabbit}}\) is the position vector of rabbit, q, and \(r_1, r_2, r_3\), and \(r_4\) are random numbers in the range of 0–1. \(X_m(t)\) is the average position of the current population of hawks, lb and ub are the lower and upper bound of the variables, and \(X_{\mathrm{rand}}(t)\) is a randomly selected solution from the current population.

2.2.2 Exploitation phase

In this phase, four strategies were suggested to model the attacking stage. Let us consider \(r < 0.5\) represents the chance of the prey in successfully escaping and \(r \ge 0.5\) represents the chance of the prey in not successfully escaping before surprise pounce. On the other hand, conditions for soft besiege and hard besiege are \(|E| \ge 0.5\) and \(|E| < 0.5\), respectively. Considering the value of r and |E|, four possibilities are given in HHO as shown in Table 1. In soft besiege strategy, prey has enough energy to escape by random misleading jumps. During these jumps, hawks encircle prey softly and then perform the surprise pounce. This strategy is performed using Eq. 5:

Table 1 Four strategies adopted in HHO based on escaping energy E and values of r in exploitation phase
$$\begin{aligned} X_{t+1}=X_{\mathrm{rabbit}}(t)-X(t)-E\left| JX_{\mathrm{rabbit}}(t)-X(t) \right| \end{aligned}$$
(5)

where \(J = 2(1 - r)\) is the random jump of prey during the escaping procedure.

In hard besiege strategy, the prey is so exhausted due to low escaping energy. In this case, hawks hardly perform the surprise pounce using Eq. 6:

$$\begin{aligned} X_{t+1}=X_{\mathrm{rabbit}}(t)-E\left| X_{\mathrm{rabbit}}(t)-X(t) \right| . \end{aligned}$$
(6)

In soft besiege with progressive rapid dives strategy, hawks decide their next move according to Eq. 7. This movement of hawks is compared with the previous dive whether it is good. If this movement is not reasonable, then hawks perform levy flight-guided movement to target the prey using Eq. 8. In this equation, D is the number of decision variables in a candidate solution also called dimension of the problem. LF is the levy flight function [33], and S is a random vector of size \({1 \times D}\):

$$\begin{aligned} Y= & {} X_{\mathrm{rabbit}}(t)-E\left| JX_{\mathrm{rabbit}}(t)-X(t) \right| \end{aligned}$$
(7)
$$\begin{aligned} Z= Y+S\times {\text{LF}}(D). \end{aligned}$$
(8)

The fitness of Y and Z is compared against the current fitness of hawks. The most appropriate position is assigned to the hawks:

$$\begin{aligned} Y= X_{\mathrm{rabbit}}(t)-E\left| JX_{\mathrm{rabbit}}(t)-X_m(t) \right| \end{aligned}$$
(9)
$$\begin{aligned} Z= Y+S\times {\text{LF}}(D). \end{aligned}$$
(10)

In hard besiege with progressive rapid dives strategy, hawks decide their next move according to Eq. 9. However, if prey performs more deceptive motions, then hawks update their positions according to Eq. 10. The final movement of hawks is decided based on the fitness of Y and Z given in Eqs. 9 and 10. At the end of the program execution, HHO returns the location of prey with respective fitness.

3 Proposed approach

The proposed approach starts with the random initialization of candidate solutions in the boundary range:

$$\begin{aligned} S_p = \mathrm{lb} + r \times (\mathrm{ub} - \mathrm{lb}) \end{aligned}$$
(11)

where \(S_p\) is the pth \((1\le p\le P)\) solution in the population and \(r \in (0, 1)\) is a random number. Please note down, here, each solution is of size \(K \times f\). K and f are the number of clusters and number of features, respectively, in a dataset taken under consideration. lb and ub are the lower and upper bounds of a given dataset. Population is the collection of solutions and represented as given in Eq. 12:

$$\begin{aligned} P = [S_1, S_2, \ldots , S_P]^T \end{aligned}$$
(12)

where P is the population. After random initialization of population, each candidate solution is evaluated using Eq. 2. The best objective value and its respective position vector are stored for future use.

The use of chaotic sequences in optimization algorithms reduces the chances of stagnation problem due to their sensitivity to the initial conditions, stochasticity, and ergodicity [34, 35]. To avoid the problem of local entrapment in HHO that might occur due to the large search domain and nonlinear objective function, we employed chaotic sequences generated by logistic chaotic map. The logistic map [36] is defined as follows:

$$\begin{aligned} x_{t+1}=cx_t(1-x_t) \end{aligned}$$
(13)

where \(c = 4\). The initial value of \(x_t\) is chosen randomly which varies from 0.65 to 0.85. The reason behind the inclusion of chaotic sequences is to improve the global search capability of HHO. Please note that \(r_2\) and \(r_3\) used in Eq. 4 are replaced by chaotic sequences given in Eq. 13. However, in the exploitation phase, random vector S used in Eqs. 8 and 10 is replaced with the chaotic sequences created by logistic chaotic map formulated in Eq. 13.

After performing exploration and exploitation phases, the updated position vectors of candidate solutions are obtained which is called offspring. The quality of solutions of offspring is checked based on the objective values to their parent population. The best P solutions are selected from the combined population of parent and offspring. This process continues until the termination condition. The steps involved in the proposed approach are given in Algorithm 1.

4 Datasets and experimental setup

To compare the performance of the proposed approach, six recently developed algorithms: Harris hawks optimization (HHO) [32], gray wolf optimizer (GWO) [11], butterfly optimization algorithm (BOA) [37], multi-verse optimizer (MVO) [38], salp swarm algorithm (SSA) [39], and sine cosine algorithm (SCA) [40], have been taken under consideration. The parameters of HHO, GWO, BOA, MVO, SSA, and SCA are set according to their corresponding references [11, 32, 37,38,39,40], respectively. However, the common parameter setting such as population size, the maximum number of iterations, and number of independent runs for all algorithms is given below:

  • population size = 100

  • maximum number of iterations = 500

  • number of independent runs = 40.

In this study, eight shape datasets and four UCI datasets are considered for performance evaluation of the algorithms. These datasets are downloaded from https://cs.joensuu.fi/sipu/datasets/. The description of the shape and UCI datasets is given in Tables 2 and 3, respectively. All algorithms have been implemented in MATLAB R2017a on a machine with a 64-bit Windows 7 operating system, 4GB of RAM, and Core-i5 processor.

figure a
Table 2 Shape datasets
Table 3 UCI datasets

5 Analysis of experimental results

This section analyzes the experimental results based on various parameters. The performances of the algorithms are compared based on the experimental values of the sum of intra-cluster distances. The best, worst, mean, and standard deviation of the sum of intra-cluster distances for each algorithm and dataset have been evaluated. These values are given in Tables 4 and 5 for shape and UCI datasets, respectively. These tables show that the best values of the sum of intra-cluster distances obtained by the proposed approach are minimum in most of the datasets considered in this study. This concludes that at the given experimental setup, CHHO is able to capture most appropriate points in the search domain compared to the rest of the algorithms. On the other hand, on observing the worst values of the sum of intra-cluster distances of the algorithms on different datasets, it was found that the values obtained by CHHO are either better or comparable with the rest of the algorithms. Mean in Tables 4 and 5 is the average of the sum of intra-cluster distances. Careful observation shows a significant difference in the mean values of algorithms for different datasets. The reason is that the datasets taken under consideration are complex and algorithms find difficulty in converging toward the optimal points. Large values of standard deviation (SD) have been observed by the algorithms. The reason is that there is a significant difference between the mean value of the sum of the intra-cluster distances and the values obtained in each iteration during the program execution.

Table 4 Experimental results of algorithms for shape datasets
Table 5 Experimental results of algorithms for UCI datasets

Tables 6 and 7 show the ranking of algorithms based on the mean value of the sum of the intra-cluster distances for shape and UCI datasets, respectively. In both cases, the average rank of CHHO is the best. To show the convergence rate of the algorithms graphically, convergence curves are drawn for that particular run in which the best value of the sum of intra-cluster distances is being obtained. These convergence curves are shown in Figs. 1, 2, 3, 4, 5, and 6. CHHO has shown efficacious behavior in most of the datasets.

Table 6 Ranking of algorithms for shape dataset
Table 7 Ranking of algorithms for UCI dataset
Fig. 1
figure 1

Convergence curves of the algorithms for a Flame dataset, b Jain dataset

Fig. 2
figure 2

Convergence curves of the algorithms for a R15 dataset, b D31 dataset

Fig. 3
figure 3

Convergence curves of the algorithms for a Aggregation dataset, b Compound dataset

Fig. 4
figure 4

Convergence curves of the algorithms for a Path-based dataset, b Spiral dataset

Fig. 5
figure 5

Convergence curves of the algorithms for a Glass dataset, b Iris dataset

Fig. 6
figure 6

Convergence curves of the algorithms for a Wine dataset, b Yeast dataset

In order to find significant differences among the experimental results, statistical tests have been performed. We employed the Friedman test and Iman–Davenport test to determine whether there are significant differences in the experimental results of the algorithms considered in this study. In general, the rejection of the null hypothesis in the Friedman test and Iman–Davenport test indicates that there are statistically significant differences in the performances of the algorithms. In such a case, a post hoc test is desirable which is used to compare the control algorithm (best performing algorithm) against the remaining ones. Here, in all the cases, we have used \(\alpha = 0.05\) as the level of confidence. Table 8 shows the statistical value and p value of the Friedman test and Iman–Davenport test for shape datasets which both reject the null hypothesis. This confirms that there are statistically significant differences in the performances of the algorithms. Now, to check whether there is a significant difference in performance between CHHO (best performing algorithm) and the rest of the algorithms, we have performed the Holm test as a post hoc test.

Table 8 Results of Friedman’s and Iman–Davenport’s tests for shape datasets based on the sum of intra-cluster distances

Table 9 shows the experimental values of the Holm test for shape datasets. The experimental results of the Holm test indicate that CHHO is statistically better than MVO, SSA, BOA, and SCA regarding the sum of the intra-cluster distances. CHHO is not statistically better than GWO and HHO based on the Holm test results. However, the results reported in Table 4 show that CHHO outperforms HHO in six out of eight datasets. On the other hand, CHHO outperforms GWO in all shape datasets. Table 10 shows the results of the Friedman test and Iman–Davenport test for UCI datasets. Both the tests reject the null hypothesis and confirm the existence of significant differences in the performances of the algorithms. Further, the Holm test is performed for UCI datasets and results are shown in Table 11. This table indicates that CHHO is statistically better than MVO, SSA, and SCA in terms of the sum of the intra-cluster distances. CHHO is not statistically better than BOA, GWO, HHO based on the Holm test results. However, Table 7 has justified the efficacy of CHHO in all datasets.

Table 9 Results of the Holm’s test for shape datasets based on the sum of intra-cluster distances (CHHO is the control algorithm)
Table 10 Results of Friedman’s and Iman–Davenport’s tests for UCI datasets based on the sum of intra-cluster distances
Table 11 Results of the Holm’s test for UCI datasets based on the sum of intra-cluster distances (CHHO is the control algorithm)

Tables 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, and 23 show the best centroids obtained by the CHHO for the datasets used in this paper. The reason behind presenting the best centroids is to validate the sum of intra-cluster distances reported in Tables 4 and 5. In this study, the maximum number of iterations was set to 500 and each algorithm runs 40 times to calculate the values of the best, worst, mean, and standard deviation. At the end of the program execution, 40 best values of the sum of the intra-cluster distances for each algorithm and for each dataset have been obtained. To make a comparison among these values, we have plotted box plot for each dataset shown in Figs. 7, 8, 9, 10, 11, and 12. On each box, the middle horizontal line indicates the median of the best values of the sum of the intra-cluster distances. These figures conclude that the best values of the sum of intra-cluster distances obtained by CHHO are more optimal compared to the rest of the algorithms.

Table 12 The best centroids obtained by CHHO algorithm for Flame dataset
Table 13 The best centroids obtained by CHHO algorithm for Jain dataset
Table 14 The best centroids obtained by CHHO algorithm for R15 dataset
Table 15 The best centroids obtained by CHHO algorithm for D31 dataset
Table 16 The best centroids obtained by CHHO algorithm for Aggregation dataset
Table 17 The best centroids obtained by CHHO algorithm for Compound dataset
Table 18 The best centroids obtained by CHHO algorithm for Path-based dataset
Table 19 The best centroids obtained by CHHO algorithm for Spiral dataset
Table 20 The best centroids obtained by CHHO algorithm for Glass dataset
Table 21 The best centroids obtained by CHHO algorithm for Iris dataset
Table 22 The best centroids obtained by CHHO algorithm for Wine dataset
Table 23 The best centroids obtained by CHHO algorithm for Yeast dataset
Fig. 7
figure 7

Box plot of the algorithms for a Flame dataset, b Jain dataset

Fig. 8
figure 8

Box plot of the algorithms for a R15 dataset, b D31 dataset

Fig. 9
figure 9

Box plot of the algorithms for a Aggregation dataset, b Compound dataset

Fig. 10
figure 10

Box plot of the algorithms for a Path-based dataset, b Spiral dataset

Fig. 11
figure 11

Box plot of the algorithms for a Glass dataset, b Iris dataset

Fig. 12
figure 12

Box plot of the algorithms for a Wine dataset, b Yeast dataset

The detailed discussion of experimental results has justified the efficacy of the suggested approach in many ways. In fact, HHO has already proved its competitive behavior in solving global optimization problems. The inclusion of chaotic sequences makes HHO more powerful by reducing the possibility of local entrapment while solving the data clustering problem. Here, the replacement of random numbers by chaotic numbers provides a global search ability to CHHO. On the other hand, MVO has shown the worst performance in all datasets used which confirms the problem of local entrapment during the optimization process. The performance of the proposed approach is quite satisfactory in solving the data clustering problem. However, it decreases when the dimension of the problem increases. On the other hand, the proposed algorithm finds difficulty in converging toward the optimal points when the search domain is very large. In such cases, the HHO algorithm in conjunction with suitable distribution (such as Cauchy or Gaussian) could be more effective to achieve the desired goal. In addition, HHO algorithm in conjunction with opposition learning-based approach could also be effective in solving the optimization problems when the dimension and search domain are very large.

6 Conclusions and future research direction

In this paper, a chaotic sequence-guided HHO algorithm is suggested for data clustering for the first time. The objective of this paper is to find the optimum position vectors of centroids that are the best representative of clusters. The performance of the proposed approach is compared against six recent state-of-the-art algorithms using 12 benchmark datasets. The suggested approach outperformed other algorithms in most of the cases. In the future, we would like to solve other real-world applications using the proposed approach. A multiobjective version of the proposed approach can also be developed and applied for handling multiple conflicting objectives simultaneously.