Keywords

1 Introduction

Data clustering is the process of dividing data into multiple clusters or groups based on similarity. Clustering has applications in many fields, including exploratory data analysis, image segmentation, and mathematical programming [1]. Many traditional clustering algorithms have been proposed, and K-means is one of the important algorithms. However, K-means is sensitive to the initialization, which greatly impacts the clustering results; once the quality of initial cluster centers is poor, effective clustering results may not be obtained and easily fall into local optimum [2].

To improve the quality of traditional clustering methods, researchers have started to study the combination of swarm intelligence and clustering algorithms. Swarm intelligence is a computational technique based on the behavioral rules of biological groups. It is inspired by social insects and swarming vertebrates, which have the characteristics of autonomy and robustness and are used to solve distributed problems. Various swarm intelligence algorithms have been proposed, and these algorithms have been used to solve data clustering tasks. For example, cohesive hierarchical clustering is introduced into the brainstorming algorithm [3], particle swarm algorithm (PSO) is efficiently mixed with fuzzy clustering [4], fireworks algorithm is combined with hard clustering technique [5], brainstorming algorithm is combined with K-means for numerical optimization [6], etc.

The bacterial foraging optimization algorithm (BFO), as a new member of the swarm intelligence, mimics the foraging behaviors of bacteria. It has received much attention recently and has been combined with clustering techniques to solve practical problems [7, 8]. To further explore the potential of BFO in solving data clustering problems, in this paper, an evolutionary factor-driven concise bacterial foraging optimization algorithm (EFCBFO) is proposed to solve the customer data clustering problems (EFCBFOK). Due to the original BFO having high computing complexity, a concise BFO (CBFO) [9] with a simplified structure is employed. The main improvements of this paper are as follows. (1) Based on the evolutionary factor [10], a modified step size updating strategy is proposed to make it change with the evolutionary states, which can better balance exploration and exploitation. (2) To guide the bacteria to find global optimum and escape local optimum, an improved chemotaxis operation is designed that integrates the delayed information. Based on this, during each iteration, bacteria can select learning objects from multiple generations of personal historical best and global best individuals, expanding the search space and enhancing the population diversity. (3) Combining the EFCBFO with K-means, EFCBFOK is designed to handle the customer data clustering problems. Comparative experiments verify that EFCBFOK has better performance than its competitors in terms of solution quality, three validity indexes, and computing time.

The remaining parts of this paper are organized as follows. Section 2 briefly introduces the traditional BFO and the K-means algorithm; Sect. 3 presents and discusses the EFCBFOK in detail. Section 4 presents the experiments and analyses. Section 5 concludes the whole paper and provides an outlook for future work.

2 Background

2.1 Bacterial Foraging Optimization Algorithm

The BFO is a stochastic search algorithm proposed by Passino in 2002 that mainly simulates the food searching behaviors of E. coli in the human intestine [11]. In this paper, three operations of BFO, chemotaxis, reproduction, and elimination-dispersal, are included [12] and described in detail.

Chemotaxis is the essential operation in the BFO, which includes two actions: swimming and tumbling. In this stage, the bacterial swarm moves to high nutrients places or away from low nutrients through these two actions. The chemotaxis operation is shown as Eq. (1),

$${\theta }^{i}(j+1,k,l)={\theta }^{i}(j,k,l)+C(i)\phi (i)$$
(1)

where \({\theta }^{i}(j,k,l)\) represents the position of bacteria \(i\) during the \(jth\) chemotaxis, \(kth\) reproduction, and \(lth\) elimination-dispersal operations. \(C(i)\) is the step size taken during the chemotaxis process and \(\phi (i)\) represents a unit length of the random direction.

The chemotaxis operation is followed by a reproduction operation. At this stage, the bacteria in poor health conditions are deleted, and bacteria in good health split into two bacteria at their current position.

For each elimination-dispersal operation, a fixed probability is used to determine whether a bacterium performs this operation; if the operation is performed, the current bacterium will die, and then a new bacterium is randomly generated in the solution space.

2.2 K-means Algorithm

The K-means is a common and simple clustering technique [13]. K-means randomly initializes a set of \(k\) cluster centers, then proceeds by alternating between two steps: assignment and update [14]. Given a dataset \(X\) with \(n\) data points, \(X=\left\{{x}_{1},{x}_{2},\dots ,{x}_{n}\right\}(i=\mathrm{1,2},\dots ,n)\). \(M\) is the set of cluster centers, \(M=\left\{{m}_{1},{m}_{2},\dots ,{m}_{k}\right\}(1\le p,q\le k)\). \({S}_{p}^{(t)}\) is the set of data points belonging to \(pth\) cluster at the \(tth\) generation. The assignment and update steps are presented as follows.

Assignment Step: compute the Euclidean distances between the data points and cluster centers, and each data point \({x}_{i}\) is assigned to the cluster with the least square Euclidean distance, which is presented in Eq. (2),

$$ S_{p}^{(t)} = \left\{ {x_{i} :\left\| {x_{i} - m_{p}^{(t)} } \right\|^{2} \le \left\| {x_{i} - m_{q}^{(t)} } \right\|^{2} ,\forall p,1 \le p \le k} \right\} $$
(2)

where \({m}_{p}^{(t)}\) and \({m}_{q}^{(t)}\) imply the \(pth\) and \(qth\) cluster centers at the \(tth\) generation, respectively.

Update Step: recalculate the average values of the data points assigned to each cluster,

$${\mathrm{m}}_{\mathrm{p}}^{(\mathrm{t}+1)}=\frac{1}{\left|{\mathrm{S}}_{\mathrm{p}}^{(\mathrm{t})}\right|}\sum_{{\mathrm{x}}_{,}\in {\mathrm{S}}_{\mathrm{p}}^{(t)}} {\mathrm{x}}_{\mathrm{ip}}$$
(3)

where \(\left|{S}_{p}^{(t)}\right|\) is the number of data points belonging to the \(pth\) cluster at the \(tth\) generation. \({m}_{p}^{(t+1)}\) is the \(pth\) cluster center at the \((t+1)th\) generation.

Usually, the objective of K-means algorithm is to minimize the sum of squared errors (SSE), which is presented in Eq. (4),

$$SS{E}^{(t)}=\sum_{p=1}^{k} \sum_{{x}_{i}\in {S}_{p}^{(t)}} D{\left({x}_{i},{m}_{p}^{(t)}\right)}^{2}$$
(4)

where \(D\left(,\right)\) is the Euclidean distance, \(SS{E}^{(t)}\) is the SSE at the \(tth\) generation.

3 The Proposed Algorithm

To improve the performance of traditional BFO, this paper proposes an evolutionary factor-driven concise bacterial foraging optimization algorithm (EFCBFO). Then, the EFCBFO is combined with K-means (EFCBFOK) to solve customer data clustering tasks. In the EFCBFOK, based on the evolutionary factors proposed in [10], evolutionary factor-driven step size and evolutionary-driven chemotaxis are designed. The details of EFCBFOK are described as follows.

3.1 Evolutionary Factors

Evolutionary factor (\({E}_{f}\)) [10] is the indicator of the discovery of the exploration and exploitation states of the population. During the evolution process, the population distribution characteristics change not only with the number of iterations but also according to the \({E}_{f}\) [10]. In [10], the \({E}_{f}\) can be predicted by the average distance between each individual. Concretely, at the beginning of the iteration, when the population is more dispersed, the average distance between each individual will be relatively large; this is the exploration stage. When the individuals reach the local or global optimal region, the average distance between each individual will be relatively small; this is the exploitation stage.

Based on this concept, the \({E}_{f}\) is calculated as follows. The first step is to calculate the average distance between the \(ith\) individual and the other individuals in the population by using the Euclidean distance. The equation is as follows.

$${d}_{i}=\frac{1}{S-1}\sum_{j=1,j\ne i}^{S} \sqrt{\sum_{d=1}^{D} {\left({\theta }^{id}-{\theta }^{jd}\right)}^{2}}$$
(5)

where \({d}_{i}\) is the average distance of the \(ith\) individual. \(S\) and \(D\) are the number of population and dimensions, respectively. \({\theta }^{id}\) and \({\theta }^{jd}\) are the position vectors of \(ith\) and \(jth\) individual in the \(dth\) dimension.

Based on the average distances of all the individuals, three important distances, \({d}_{g}\), \({d}_{min}\), and \({d}_{max}\), are defined. Specifically, \({d}_{g}\) is the average distance of the global best individual. \({d}_{min}\) and \({d}_{max}\) are the minimal and maximal average distances in all the average distances, respectively. After getting these distances, the \({E}_{f}\) is calculated as,

$${E}_{f}=\frac{{d}_{g}-{d}_{min}}{{d}_{max}-{d}_{min}}$$
(6)

It can be seen that the \({E}_{f}\) is located in the range [0,1]. It will be relatively small when the average distance between bacteria is relatively close and relatively large when the average distance between bacteria is relatively far.

According to the \({E}_{f}\), evolutionary states can be obtained [10, 15]. In [15], four types of evolutionary states are exploration state, exploitation state, convergence state, and jump-out state. These states denoted \(\xi (k)\) can be acquired by dividing the \({E}_{f}\) into four equal intervals, which is presented in Eq. (7),

$$\xi (k)=\left\{\begin{array}{c}1,\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}0\le {E}_{f}<0.25\\ 2,\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}0.25\le {E}_{f}<0.5\\ 3,\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}0.5\le {E}_{f}<0.75\\ 4,\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}0.75\le {E}_{f}\le 1\end{array}\right.$$
(7)

When \(\xi (k)\) is equal to 1, 2, 3, and 4, it is the convergence, exploitation, exploration, and jumping-out states, respectively.

3.2 Evolutionary Factors-Driven Step Size

In the original BFO, the step size \(C(i)\) is the length of each step during the swimming action, which is a constant. However, if \(C(i)\) is too small, the bacteria focus on local search/exploitation, and it may take a long time to find the optimal value; if \(C(i)\) is too lager, the bacteria focus on global search/exploration, and the optimal value may be missed. Based on these analyses, it can be observed that \({E}_{f}\) shares some characteristics with the \(C(i)\), i.e., \({E}_{f}\) is relatively large in the exploration and jump-out states and relatively small in the convergence state [10]. Therefore, \(C(i)\) can be defined based on the \({E}_{f}\), which is presented in the following equation,

$$C(i)=\left({C}_{max}-{C}_{min}\right){E}_{f}+{C}_{min}$$
(8)

where \({C}_{max}\) and \({C}_{min}\) are the maximal and minimal step sizes, respectively. This paper sets \({C}_{max}\) as 0.1 and \({C}_{min}\) as 0.01. The step size varies with the \({E}_{f}\), and a larger \(C(i)\) will be more favorable for global search in the jump-out and exploration states; the smaller \(C(i)\) in the convergence state favors the local search.

3.3 Evolutionary Factors-Driven Chemotaxis Operation

To make better use of the historical information, delayed information of bacterial swarm is used to guide the bacteria to move to the optimal directions. Concretely, two indicators denoted \({\varepsilon }_{i}(k)\) and \({\varepsilon }_{g}(k)\) are employed. Among them, \(k\) is the information delay interval, which implies that the personal historical best and global best of recent \(k\) generations should be recorded and used. \({\varepsilon }_{i}(k)\) and \({\varepsilon }_{g}(k)\) are two uniformly generated integers in the range of [1, \(k\)]. \(i\) and \(g\) represent the indexes of personal best and global best, respectively.

Additionally, another two indicators denoted as \({s}_{i}(k)\) and \({s}_{g}(k)\) are used in this paper. Combining the evolutionary states, the values of \({s}_{i}(k)\) and \({s}_{g}(k)\) are shown in Table 1. In the convergence state, the bacteria are expected to reach the region near the global optimum, so the value of \({s}_{i}(k)\) and \({s}_{g}(k)\) is taken as 0. In the exploitation state, as much local information as possible needs to be used, so the value of \({s}_{i}(k)\) is taken as \({E}_{f}(k)\). In the exploration state, more global information needs to be used, so the value of \({s}_{g}(k)\) is taken as \({E}_{f}(k)\). In the jump-out state, the bacterial subsets desire to jump out from the region near the local optimum, so the value of \({E}_{f}(k)\) needs to be taken at the same time to provide more information for the bacteria to jump out from the local optimum.

Table 1. Values of indicators.

Based on the aforementioned analysis, an improved chemotaxis operation is designed, which is shown as follows,

$$\begin{array}{c}{\theta }^{i}(j+1,k,l)={\theta }^{i}(j,k,l)+C(i)\phi (i)\\ +{s}_{i}(k)C(i){r}_{1}\left({p}_{i}\left(k-{\varepsilon }_{i}(k)\right)-{\theta }^{i}(j,k,l)\right)\\ +{s}_{g}(k)C(i){r}_{2}\left({p}_{g}\left(k-{\varepsilon }_{g}(k)\right)-{\theta }^{i}(j,k,l)\right)\end{array}$$
(9)

where \({r}_{1}\) and \({r}_{2}\) are the uniformly generated numbers in [0,1]. \({p}_{i}\left(k-{\varepsilon }_{i}(k)\right)\) And \({p}_{g}\left(k-{\varepsilon }_{g}(k)\right)\) are the selected personal historical best and global best individuals, respectively. It can be seen that the designed chemotaxis operation includes four parts. The first and second parts are the same as the original BFO. The third and fourth parts are the self-learning and global learning parts with delayed information. Based on the evolutionary states, the bacteria can learn from different individuals.

3.4 The Framework of EFCBFOK

Combining EFCBFO and K-means, EFCBFOK is designed to handle customer clustering tasks. In EFCBFOK, SSE is the objective function. The framework of EFCBFOK is described as follows (Fig. 1).

  • Step 1. Initialize the position of the population and the parameters of the algorithm.

  • Step 2. Evaluate the fitness values of the population, and store their personal historical best and global best.

  • Step 3. Iteration loop.

  • Step 3.1. Obtain the evolutionary factors according to Eq. (6), and obtain Table 1 according to Eq. (7).

  • Step 3.2. Update the positions of the population by implementing evolutionary factors-driven chemotaxis operation.

  • Step 3.3. If the iteration number is a multiple of reproduction frequency (\({F}_{re}\)), implement the reproduction operation.

  • Step 3.4 If the iteration number is a multiple of elimination-dispersal frequency (\({F}_{ed}\)), implement the elimination-dispersal operation.

  • Step 4. Repeat step 3 until the conditions are met.

Fig. 1.
figure 1

The framework of EFCBFOK.

4 Experiments and Analyses

4.1 Datasets and Experimental Parameters

To demonstrate the superiority of the EFCBFOK, five datasets, Taiwan, German, AustralianFootnote 1, Marketing, and HotelFootnote 2, are selected as the testing datasets. The missing and invalid data are deleted before clustering. The description of the five testing datasets is shown in Table 2.

Table 2. The description of the testing datasets.

Additionally, three algorithms are selected as the competitors, which are K-means, PSO-based clustering technique (PSOK) [16], and CBFO-based clustering algorithm (CBFOK) [9]. The parameters of EFCBFOK, PSOK, and CBFOK are listed as follows. The population size is 100; the number of independent runs and iterations are 10 and 100, respectively. For EFCBFOK and CBFOK, the reproduction frequency is 5, and the elimination-dispersal frequency is 2. For PSOK, the \({C}_{1}\mathrm{ and}{ C}_{2}\) are 2. These algorithms are coded using PyCharm Community Edition 2021. To evaluate the clustering quality of all the algorithms, inter-cluster distance, Silhouette [17], and F-measure [18] are selected as validity indexes.

4.2 Experimental Results and Analysis

Table 3 gives the average optimal solutions of three validity indexes over 10 runs. Table 3 also gives the computation times for the four algorithms. This paper uses boldface with underline and boldface to highlight the best and second-best values of the four algorithms on different metrics. Figure 2 shows the SSE convergence curves for all algorithm traversals in the five datasets, respectively. From Table 3 and Fig. 2, three observations can be concluded.

  • The EFCBFOK algorithm performs well than its competitors regarding to the three validity indexes on all five datasets, especially on German and marketing datasets. As for the F-measure, EFCBFOK obtains overwhelming advantages over its peers. This implies that the EFCBFOK algorithm effectively improves the clustering quality of customer datasets. Conversely, PSOK has the worse performance among these four algorithms, which only gets several second ranking on some datasets regarding one validity index.

  • In terms of computing time, although K-means performs optimally, the EFCBFOK uses less time on the five data sets compared to the swarm intelligence-based clustering algorithms. This implies that the proposed EFCBFOK has a faster convergence speed than that of CBFOK and PSOK.

  • From the iterative curves, it can be seen that the iterative curve of EFCBFOK is below the other algorithms. This means that the EFCBFOK outperforms the other three algorithms in terms of global optimality regardless of the dataset.

Table 3. The experimental results of EFCBFOK and its competitors on five datasets.
Fig. 2.
figure 2

SSE iterative curves of four algorithms on five datasets

5 Conclusion

This paper proposes a concise evolutionary factor-driven bacterial foraging optimization algorithm to solve the customer clustering problem (EFCBFOK). First, the concise BFO with a simplified structure is used to decrease the computing complexity of BFO. Then, a modified step size strategy is proposed according to the evolutionary factors. Additionally, driven by the evolutionary factor, an improved chemotaxis operation is proposed to let the bacteria select the learning individuals from multiple generations of personal historical best and global best; it can expand the search space and enhance the diversity. To validate the effectiveness of the EFCBFOK, EFCBFOK is compared with the other three algorithms on three validity indexes of five customer datasets. Experimental results demonstrate that EFCBFOK has better performance than its competitors regarding solution quality, three validity indexes, and computing time.

In future work, EFCBFOK will be used to solve multi-objective data clustering tasks. Furthermore, more strategies should be designed to enhance the performance of BFO.