A chaotic sequence-guided Harris hawks optimizer for data clustering

Singh, Tribhuvan

doi:10.1007/s00521-020-04951-2

A chaotic sequence-guided Harris hawks optimizer for data clustering

Original Article
Published: 15 May 2020

Volume 32, pages 17789–17803, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

A chaotic sequence-guided Harris hawks optimizer for data clustering

Download PDF

Tribhuvan Singh¹

495 Accesses
30 Citations
Explore all metrics

Abstract

Data clustering is one of the important techniques of data mining that is responsible for dividing N data objects into K clusters while minimizing the sum of intra-cluster distances and maximizing the sum of inter-cluster distances. Due to nonlinear objective function and complex search domain, optimization algorithms find difficulty during the search process. Recently, Harris hawks optimization (HHO) algorithm is proposed for solving global optimization problems. HHO has already proved its efficacy in solving a variety of complex problems. In this paper, a chaotic sequence-guided HHO (CHHO) has been proposed for data clustering. The performance of the proposed approach is compared against six state-of-the-art algorithms using 12 benchmark datasets of the UCI machine learning repository. Various comparative performance analysis and statistical tests have justified the effectiveness and competitiveness of the suggested approach.

A hybrid optimization approach based on clustering and chaotic sequences

Article 12 July 2019

Improving K-Means with Harris Hawks Optimization Algorithm

ANWOA: an adaptive nonlinear whale optimization algorithm for high-dimensional optimization problems

Article 16 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Rapid development in different areas of science and technologies requires the storage of huge amount of data in the database. These datasets are often very complex and involve many parameters. The huge amount of data does not give any useful information without processing. As manual processing of these datasets is beyond the scope of human capacities, hence, people motivate toward the computing technologies for automating a model that can solve the desired purpose [1]. Data mining is one of the most important techniques that extract useful information from a large-scaled dataset to achieve the desired goal. It basically involves three steps to target the desired objective(s). The first step is responsible for preparing the data, called the scrubbing of data. In the second step, a suitable data mining algorithm needs to be selected. In the last step, data are analyzed. Among these, the selection of suitable data mining algorithm improves the overall performance of the model. In addition, data mining has various methods such as summarization, association, and data clustering to discover the hidden patterns in the large database for practical applications. Among these, data clustering is one of the most popular methods and gained a growing research interest in the last few decades.

The objective of clustering is to partition N data objects into K clusters. The data objects in a cluster are highly similar to one another; however, they are highly dissimilar with the data objects of other clusters [2]. Clustering is being widely used in different domains of science and engineering such as community detection [3], web page recommendation [4], text mining [5], image segmentation [6], stock market prediction [7], and many more. Clustering is an optimization problem, and various optimization algorithms have been suggested recently to achieve the desired goal. Broadly speaking, these algorithms are called nature-inspired algorithms that use natural phenomenon during the optimization process. Nature-inspired algorithms are broadly classified into two subclasses: evolutionary algorithms [8, 9] and swarm intelligence-based algorithms [10, 11].

Various heuristic approaches based on evolutionary algorithms have been proposed for data clustering in the last few decades [12]. Maulik et al. [13] proposed a clustering approach based on differential evolution (DE) for image classification. The authors have justified the efficacy of their proposal. Das et al. [14] suggested an approach based on DE for clustering the pixels of an image in the gray-scale intensity space. A survey on clustering based on nature-inspired metaheuristic algorithms is given in [15]. Recently, various swarm intelligence-based algorithms have been developed that are later applied for solving data clustering problems. The artificial bee colony (ABC) algorithm-guided data clustering approach has been suggested in [16, 17]. An extensive survey is presented in [18, 19] on data clustering approaches based on particle swarm optimization. A gray wolf optimizer-based clustering algorithm is proposed in [20]. In the last few decades, chaotic sequences created by chaotic maps have been used in optimization algorithms to improve their performances [21]. In this context, Chuang et al. [22] proposed an approach based on a chaotic map and PSO for solving the data clustering problem. Li et al. [23] proposed a clustering algorithm based on a chaotic PSO and gradient method. Wan et al. [24] suggested a chaotic ant swarm approach for data clustering.

Jamshidi et al. [25] proposed an adaptive neuro-fuzzy inference system to identify dynamic behaviors of a lithium ion. Authors have justified the effectiveness of their approach in minimizing the errors while handling the systems with multiple nonlinear behaviors and uncertainties. An adaptive neuro-fuzzy inference system based on a subtractive clustering algorithm is proposed in [26] for system identification of remaining useful life in the electrolytic capacitors. A new multiobjective approach for detecting money laundering is presented in [27]. Jamshidi et al. [28] proposed a neuro-fuzzy-guided technique to model a Li-ion battery used in a small satellite of NASA. Authors have claimed the competitiveness of their approach while achieving the desired goal. In addition, variety of optimization problems are being solved using appropriate optimization algorithms [29,30,31].

Recently, a swarm intelligence-based algorithm, called Harris hawks optimization (HHO) [32], is proposed for solving global optimization problems. The main inspiration of HHO is the cooperative behavior and chasing style of Harris’ hawks in nature called surprise pounce. HHO has already proved its effectiveness and competitiveness for solving complex problems. In this paper, a novel approach based on chaotic sequence and HHO (CHHO) is suggested for solving the data clustering problem. In short, the novelty and major contributions are given below:

A chaotic Harris hawks optimizer (CHHO) is proposed for data clustering for the first time.
Twelve standard benchmark datasets have been utilized to evaluate the performance of the suggested approach.
Six well-known recently developed nature-inspired algorithms have been considered to compare the performance against the suggested approach.
To prove the efficacy of the suggested approach statistically, three statistical tests have been performed.

The rest of the paper is organized as follows: Section 2 describes the basic idea of clustering and Harris hawks optimizer. In Sect. 3, the proposed approach is described in detail. A description of used datasets and experimental setup is given in Sect. 4. Analysis of experimental results is given in Sect. 5. Finally, Sect. 6 concludes the proposed approach and highlights some future research work.

2 Preliminaries

2.1 Clustering

Clustering is the process of classifying N data objects into K clusters in such a way that the sum of the intra-cluster distance should be minimized and the sum of the inter-cluster distance should be maximized [15]. Mathematically, a dataset with N data objects is represented as $D = \{D_1, D_2,\ldots , D_N\}^{\mathrm{T}}$, where $D_i$ = $\{d_i^1, d_i^2, \ldots , d_i^f\}$. Here, f is number of features or attributes also called dimensionality in a given data object of a dataset and $d_i^j$ is the jth feature of $D_i$. Here, a dataset can be represented in the form of a matrix of size $N\times f$ as follows:

$$\begin{aligned} D = d_i^j, \quad 1\le i\le N \quad {\text{and}}\quad 1\le j\le f. \end{aligned}$$

(1)

In fact, the objective of clustering is to partition dataset D into K clusters, $C_1, C_2, \ldots , C_K$, where the data objects within a cluster should be as similar as possible; however, the data objects of different clusters should be as distinct as possible. The fitness function for calculating the sum of the intra-cluster distances is given below:

$$\begin{aligned} F(D,Z) = \sum _{i=1}^N\sum _{k=1}^Kx_{ik}||(D_i-Z_k)||^2 \end{aligned}$$

(2)

where F(D, Z) is the sum of the intra-cluster distances also called fitness value that needs to be minimized, and $||(D_i-Z_k)||$ is the Euclidean distance between a data object $D_i$ and the cluster center $Z_k$. $x_{ik}$ is the association weight of data object $D_i$ with cluster k, which will be 1 if data object i is assigned to cluster k; otherwise, $x_{ik} = 0$.

2.2 Harris hawks optimization (HHO)

HHO is one of the swarm intelligence-based techniques for solving complex problems. This algorithm is inspired by the cooperative behavior and chasing style of Harris’ hawks in nature called surprise pounce. Any optimization algorithm does try to capture the global optimal solution while implementing the mechanism of exploration and exploitation. These optimization algorithms start their execution with random initialization of the candidate solutions. In HHO, Harris’ hawks are the candidate solutions and the best candidate solution in each step is considered as the intended prey or nearly the optimum. Based on the escaping energy of the prey, HHO decides that candidate solutions (Harris’ hawks) would explore or exploit in the search domain. The escaping energy (E) depends on the current iteration (t), maximum number of iterations (T), and initial state energy $(E_0)$ of the prey which is calculated as follows:

$$\begin{aligned} E=2E_0 \left( 1-\frac{t}{T}\right) \end{aligned}$$

(3)

Here $-1<E_0<1$. If $|E|\ge 1$, HHO enters into the exploration phase. In this phase, hawks search for different regions to target the rabbit location. Here, the rabbit location is the location of the intended prey. On the other hand, if $|E|>1$, HHO does try to implement the exploitation phase. In this phase, hawks search in the neighborhood of their location.

2.2.1 Exploration phase

In this phase, candidate solutions update their position vectors as follows:

$$\begin{aligned} X_{{t + 1}} = \left\{ {\begin{array}{ll} {X_{\mathrm{rand}} \left( t \right) - r_{1} \left| {X_{\mathrm{rand}} \left( t \right) - 2r_{2} X\left( t \right) } \right| ,} &{} \quad {q \ge 0.5} \\ {\left( {X_{\mathrm{rabbit}} \left( t \right) - X_{m} \left( t \right) } \right) - r_{3} \left( {lb + r_{4} \left( {ub - lb} \right) } \right) ,} &{} \quad {q < 0.5} \\ \end{array} } \right. \end{aligned}$$

(4)

where $X_{t+1}$ and $X_t$ are the updated and current position vectors of hawks. $X_{\mathrm{rabbit}}$ is the position vector of rabbit, q, and $r_1, r_2, r_3$, and $r_4$ are random numbers in the range of 0–1. $X_m(t)$ is the average position of the current population of hawks, lb and ub are the lower and upper bound of the variables, and $X_{\mathrm{rand}}(t)$ is a randomly selected solution from the current population.

2.2.2 Exploitation phase

In this phase, four strategies were suggested to model the attacking stage. Let us consider $r < 0.5$ represents the chance of the prey in successfully escaping and $r \ge 0.5$ represents the chance of the prey in not successfully escaping before surprise pounce. On the other hand, conditions for soft besiege and hard besiege are $|E| \ge 0.5$ and $|E| < 0.5$, respectively. Considering the value of r and |E|, four possibilities are given in HHO as shown in Table 1. In soft besiege strategy, prey has enough energy to escape by random misleading jumps. During these jumps, hawks encircle prey softly and then perform the surprise pounce. This strategy is performed using Eq. 5:

Table 1 Four strategies adopted in HHO based on escaping energy E and values of r in exploitation phase

A chaotic sequence-guided Harris hawks optimizer for data clustering

Abstract

Similar content being viewed by others

A hybrid optimization approach based on clustering and chaotic sequences

Improving K-Means with Harris Hawks Optimization Algorithm

ANWOA: an adaptive nonlinear whale optimization algorithm for high-dimensional optimization problems

Explore related subjects

1 Introduction

2 Preliminaries

2.1 Clustering

2.2 Harris hawks optimization (HHO)

2.2.1 Exploration phase

2.2.2 Exploitation phase

3 Proposed approach

4 Datasets and experimental setup

5 Analysis of experimental results

6 Conclusions and future research direction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation