1 Introduction

Cluster analysis has been widely applied in many areas. Its goal is to partition a set of patterns into disjoint and homogeneous clusters [23]. Basically, clustering algorithm groups the data with high similarity altogether and so minimizes the similarity among each clusters. It allows us to find how data spreads as well as their features. Through the process of reducing the complexity, clustering algorithm also allows us to analyze the hidden commercial implication under the clustered data and would be beneficial for business decision making.

There already have been many clustering methods proposed in the literatures. Among them, metaheuristics which are heuristics and stochastic search procedures based on the mechanics of natural selection, genetics, and evolution are most applied. Metaheuristics allow them to find the global solution for a given problem [29]. Therefore, this study considers the use of a kind of AISs, artificial immune network (aiNet) for cluster analysis. The proposed clustering algorithm is an integration of aiNet and K-means algorithm (aiNetK). In order to testify the proposed aiNetK algorithm, four benchmark data sets, Iris, Wine, Glass, and Breast Cancer, were employed. They are compared with those of AIS-based algorithms and PSO-based algorithms.

The remainder of this study is organized as follows. Section 2 briefly presents the necessary background, while the proposed aiNetK algorithm is proposed in Sect. 3. Section 4 discusses the experiment analysis results using four benchmark data sets. Finally, the concluding remarks are made in Sect. 5.

2 Backgound

This section will briefly survey some literatures regarding the current study. They include cluster analysis and artificial immune systems.

2.1 Cluster analysis

Cluster analysis which is a very important technique in data mining can partition data into a certain number of clusters. A cluster is described by considering internal homogeneity and external differentiation, i.e., patterns in the same cluster should be similar to each other, while patterns in different clusters should not [38]. Basically, clustering algorithms can be divided into two categories: hierarchical and partitional clustering [14]. Hierarchical clustering algorithms organize data into a hierarchical structure according to the proximity matrix. The benefit is that the results of hierarchical clustering are usually depicted by a binary tree or den diagram. Hierarchical clustering algorithms can be further classified as agglomerative and divisive methods. Many hierarchical clustering algorithms have been presented including CURE [10], BIRCH [42], ROCK [11], CHAMELEON [16], DIANA [17], AUTOCLUST [7], and AMOEBA [6].

Unlike to hierarchical clustering which produces a successive level of clusters by iterative fusions or divisions, partitional clustering assigns a set of objects into clusters with no hierarchical structure. These algorithms intend to minimize certain criteria, like square error function, and can therefore be treated as optimization problems. Partitional clustering aims to optimize cluster centroids as well as the number of clusters [13]. Partitional clustering algorithms are mainly classified into supervised and unsupervised clustering algorithms. The main difference is that supervised clustering algorithms need to pre-determine the number of clusters, while the other does not. Recently, the semi-supervised clustering algorithm is developed [43, 44]. Since the algorithm proposed in this study belongs to supervised clustering algorithm, it will be discussed with more details.

The most widely used supervised clustering algorithm is the K-means algorithm [8]. Latter and Bezdek [2] developed a fuzzy version of the K-means algorithm, called the fuzzy C-means algorithm. Unlike K-means algorithm, the object can belong to all of the clusters with a certain degree of membership. Other modifications of fuzzy C-means algorithms are the possibilistic-means clustering algorithm [19], fuzzy c-shells [3], hierarchical unsupervised fuzzy clustering [9, 25], fuzzy possibilistic c-means [32, 37], and semi-supervised fuzzy clustering algorithm [28, 39].

In addition, metaheuristics are also employed to improve the performance of K-means. Clustering can be regarded as a category of optimization problems that uses metaheuristic algorithms, like GA, PSO, or ant colony optimization (ACO) algorithms. Many techniques support this method, such as the genetically guided [12], the genetic K-means [18, 26] , Tabu search [1], the ant colony clustering [20], and the particle swarm K-means optimization (PSKO) algorithm [21].

2.2 Artificial immune network

In 1974, Jerne proposed the first mathematical model in artificial immune system, which initiated subsequent researches and discussions. Liu et al. [24] suggested that artificial immune system (AIS) is a novel intelligent algorithm method, inspired from the biological immune system. The main principle of AIS is to imitate artificial immune system, and AIS can acquire learning capability by learning the biological protection principle.

The Opt-aiNET derives from aiNet, which uses clonal selection and affinity maturation in the immune theory and the immune network theory to establish the network model as an important application model of the artificial immune system. Its advantages include noise acceptability, unsupervised learning, and self-organization. The applications include data processing, optimization learning, and fault diagnosis. Timmis and Edmonds [36] suggested that the optimal artificial immune network model is an adaptive process based on an immune principle to perform search and optimization. The procedures in the model include the fitness value evaluation made by antibody for objective function, clonal expansion antibody maturation, and suppression. There are some characteristics for opt-aiNet suggested by Pasti and Castro [31]. The population size will grow or reduce with iteration. Clone produces similar antibodies by copying populations. The number of copies in each population is the same. Besides, the variation of cloned antibodies is related to the fitness value of the original populations. Basically, a better fitness value indicates smaller variation. The excessively similar antibodies are suppressed. The affinity of antibodies is smaller than preset threshold \(\sigma _s \). The average fitness value evaluation aims to explore development space. If there is a deviation from the last generation, the exploration can be continued, and the suppression phase can be performed until no deviation occurs. New cells will be added after suppression to prevent solution from falling on the regional optimal solution.

Bezerra et al. [4] proposed the adaptive radius immune algorithm (ARIA), through which the most density information of the compressed data can be retained. The simulation experiment showed that the clustering effect of the method is better than aiNet clustering algorithm. Tang and Vemuri [34] used aiNet model in the more complex file clustering field. Based on the immune network and mutation principle, aiNet algorithm has good clustering result, and better performance than hierarchical clustering method and K-means clustering method. Sotiropoulos et al. [33] applied aiNet to cluster analysis for customer online shopping. The method is compared with hierarchical, fuzzy c-means, and spectral analysis methods. The experiment has demonstrated aiNet clustering analysis yields better result. Hart and Ross [15] combined AIS with sparse distributed memories (SDM) to propose a model used for processing a mass of dynamic data. The model can follow up dynamic data set and memorize past clustering results. Nasraoui et al. [30] suggested that AIS algorithm can be used for network mining of clustering of dynamic data strings, which are clicked on continuously in student course grouping. Younsi and Wang [40] applied clonal selection in basic artificial immune system to data clustering, suggested an easier algorithm, and used four simulation data sets and two reference data sets to test data clustering. The experimental result indicated that the data sets can be clustered correctly. Based on anti-spam technology, Yue et al. [41] proposed a new e-mail service system using artificial immune clustering method to filter all messages. Li et al. [22] suggested self-adaptive capability of multiple-clone clustering algorithm based on the biological immune and clonal selection principle. The data simulation experiment can prove that the algorithm proposed is a rational and effective clustering method and can yield better result as compared to K-means. Based on immune control and immune selection mechanism in the biological immune system, Ma et al. [27] suggested a new data mining method- immune Dominance Clonal Multi-objective Clustering algorithm (IDCMC). The parent population antibody is divided into three sub-antibody sets. Different measurement methods and evolution selection strategies are used. Through simulation experiment, this algorithm has better clustering performance than those of GA and K-means. Chiu et al. [5] suggested that hybrid of artificial immune system and ant algorithm can be used for air-conditioner market segments of 3C shopping. The experiment proved that the hybrid method has better clustering result than the SOM clustering method.

3 Methodology

This section will present the proposed method for cluster analysis. It consists of two main components: (1) data encoding and (2) aiNetK algorithm.

3.1 Data encoding

Each antibody represents a clustering solution with cluster centroids \(Z_{i}\) where \(i\) is the \(i\)th cluster centroid. Thus, for an antibody with four cluster centroids, it can be represented as shown in Fig. 1.

Fig. 1
figure 1

Real-coding method for aiNetK algorithm

3.2 aiNetK algorithm

The proposed clustering algorithm is an integration of aiNet and K-means algorithm. Figure 2 illustrates the algorithm flowchart. The detailed procedure of aiNetK algorithm is as follows.

Fig. 2
figure 2

aiNetK algorithm flowchart

Step 1: Set up parameters

In aiNetK algorithm, it is necessary to determine some parameters including number of generations, number of memory cells, \(M\), number of remaining cells, \(R\), number of clone, \(N_{c}\), error threshold between two generations, and self-suppression threshold, \({\sigma }_s \).

Step 2: Generate initial population randomly

At the beginning of generation, generate initial population randomly. The initial population includes \(P\) antibodies with \(M\) memory cells and \(R\) remaining cells. Population set \(P\) will toward to the optimal solution using the following steps. In the proposed aiNetK algorithm, it is necessary to pre-specify the number of clusters, \(K\). Since every antibody is consisted of \(K\) centroid vectors, it can be represented as:

$$\begin{aligned} P_{id} =\left( {X_{i1} ,\ldots ,X_{ij} ,\ldots ,X_{ik} } \right) , \end{aligned}$$
(1)

where \(P_{id} \) represents the \(i\)th antibody, \(i\) is the antibody number, and \(d\) is the cluster number.

Step 3: Calculate fitness value for population

This study uses Euclidean distance to calculate the fitness value for antibody. The procedure is as follows:

  1. (i)

    Calculate the distance of each data point \(X_{i}\) to each cluster centroid, \(C_{i,j}\). Distance of \(X_{i}\) to \(C_{i,j}\) is represented as:

    $$\begin{aligned} d(X_i ,C_{ij} )=\left\| {X_i -C_{ij} } \right\| , \end{aligned}$$
    (2)

    where \(X_{i}\) is the \( i\)th data point vector and \(C_{ij}\) is the centroid of \(i\)th antibody and \(j\)th cluster.

  2. (ii)

    Form \(K\) clusters by assigning each data point to its closest centroid for every antibody.

  3. (iii)

    Calculate fitness value, Ab\(_{i}\), for every antibody. Ab\(_{i}\) is represented as:

    $$\begin{aligned} (\text{ Ab })_i&= \frac{1}{1+D_i } ,\quad 0\le (\text{ Ab })_i \le 1, {i}\in {P};\end{aligned}$$
    (3)
    $$\begin{aligned} D_i&= \text{ SED }_i =\sum _{j=1}^K \sum _{\forall X_i \in n_{ij} } \left\| {X_i -C_{ij} } \right\| , \end{aligned}$$
    (4)

    where \(D_i \) is sum of Euclidean distance (SED) of \(i\)th antibody between \(X_{i}\) and \(C_{i,j}\). \(n_{i,j}\) is the total number of data points for of \(j\)th cluster.

Step 4: Clone the population to produce \(N_{c}\) clones

Every antibody in the population is cloned, or copied, \(N_{c}\) times to generate new population with \(P\times N_c \) antibodies.

Step 5: Retain the original population and maturate affinity of remaining antibodies

In this part, the original population needs to be retained first and then affinity of remaining antibodies can be maturated to prevent the antibodies after variation from being worse than the parameters before variation. The variation steps are as follows:

  1. (i)

    Calculate mutation rate. The mutation rate is related to the fitness value. The mutation rate of the antibodies with higher affinity is smaller. The equation is presented by Eq. (5):

    $$\begin{aligned} \text{ Maturate } \text{ rate }=\alpha _{i} =\left( \frac{1}{\rho }\right) e^{-\left( {\text{ Ab }^{*}} \right) _i }, \end{aligned}$$
    (5)

    where \(\left( {\text{ Ab }^{*}} \right) _i =\left( \text{ Ab } \right) _i /(\text{ Ab })_\mathrm{max} ,\,\left( {\text{ Ab }^{*}} \right) _i \) is a normalized affinity value; \(\rho \) is constant; \(i\in \left( {P*N_c} \right) -P.\)

  2. (ii)

    Maturate affinity as shown in Eq. (6):

    $$\begin{aligned} c^{{\prime }}=c+\alpha i\times N\left( {0,1} \right) , \end{aligned}$$
    (6)

    where \({c}^{{\prime }}\): antibody serial after affinity maturation, \({c}=({{c}}_1 ,\ldots ,{{c}}_2 ,\ldots ,{{c}}_{{x}} )\) stands for cloned antibody serial, \(\alpha i\): mutation rate, \(N\left( {0,1} \right) \): standard normal distribution with the average value 0 and standard deviation 1.

Step 6: Perform one-iteration K-means and obtain new antibody cluster

Calculation steps of one-iteration K-means are as follows:

  1. (i)

    Calcuate Euclidean distance from each data \(X\) in each antibody to centroid of each cluster, as shown in Eq. (7):

    $$\begin{aligned} d(X_i ,C_{ij} )=\left\| {X_i -C_{ij} } \right\| . \end{aligned}$$
    (7)
  2. (ii)

    In each antibody, each data \(X\) is assigned to the nearest cluster centroid and belongs to the cluster.

  3. (iii)

    Calculate new cluster centroid vector in each antibody and obtain new antibody cluster, as shown in Eq. (8):

    $$\begin{aligned} C_{ij}^\mathrm{new} =\frac{1}{n_{ij}}\sum _{\forall X_i \in n_{ij} } X_i , \end{aligned}$$
    (8)

    where \(C_{ij}^\mathrm{new} \): new centroid vector of the \(j\)th cluster in the \(i\)th antibody, \(n_{ij} \): total quanity of data of the \(j\)th cluster in the \(i\)th antibody.

Step 7: Recalculate the fitness value of new antibodies

The cluster centroids in the antibodies may change after affinity maturation and one-iteration K-means. Therefore, the fitness value of the new antibody should be calculated before selection of optimal antibody. The calculation of the fitness value is the same as Step 3.

Step 8: Select the optimal antibody in each group to replace original population

In this step, there are \(P\) groups of new antibody clusters, and each cluster has \(N_{c}\) clones. In order to obtain optimal antibody of each group, the antibodies in each group are prioritized in terms of the fitness value in selection. After prioritization, the first antibody (with the highest affinity) in each group is selected to replace original population \(P\). At this time, the optimal antibody selected from each group may be the worst antibody in other groups.

Step 9: Determine whether average error between two generations of populations has difference

Identify whether average errors between two generations of populations have difference. If the difference value (absolute value) is greater than error threshold, this implies that there is difference between two generations and the population has not been searched completely. It is necessary to return to Step 4. The threshold value should be pre-determined. This study employs Taguchi parameter design to obtain the value in the subsequent section. The calculation of population error is shown in Eq. (9):

$$\begin{aligned} \text{ Average } \text{ population } \text{ error }=\text{ Population } \text{ error }=\sum _i^p \frac{\text{ SED }_i }{n}, \end{aligned}$$
(9)

where \(n\) stands for number of population in this generation.

Step 10: Autogenous suppression

The autogenous suppression aims to expand the search scope by deleting excessively similar antibodies and prevent regional optimal solution. This study arrays the optimal populations selected in the last step in descending order and then calculate the Euclidean distance between pairwise antibodies, as shown in Eq. (10). If the distance is smaller than the autogenous suppression threshold \(\sigma _{s}\), the antibody with lower affinity is deleted, and the better antibody is retained and used as memory cell \(M\); memory cell \(M\) is multiplied by \(d\) % which is number of the remaining cells \(R\). Thus, the number of memory cells and remaining cells in each iteration is different with autogenous suppression. The autogenous suppression \(\sigma _{s}\) and \(d\) % are also parameters needing determination. This study employs Taguchi parameter design to obtain the values in the subsequent sections.

$$\begin{aligned} d(C_i ,C_j )=\left\| {C_i -C_j } \right\| \end{aligned}$$
(10)

Step 11: Determine whether the number of iterations is statisfied or not

If the preset iteration number is reached, then stop; Otherwise, go back to Step 3.

4 Computational experiences

This section applies some benchmark data sets to assess the proposed algorithm, aiNetK, in comparison with particle swarm optimization and artificial immune system-related algorithms. A detailed discussion is presented as follows.

4.1 Data sources

This section adopts four data sets provided by the Web site of the Department of Information and Computer Science, University of California (http://archive.ics.uci.edu/ml/datasets.html). Table 1 lists the corresponding information including data set name, number of data, number of features, and number of clusters, for each data set.

Table 1 Benchmark data sets

4.2 Taguchi experimental design

The metaheuristic performance is affected by the combination of parameters. The full factorial analysis is very time consuming, a very large number of cases has to be studied if more than three parameters are considered. To reduce the required number of simulations, a Taguchi method based on orthogonal parametric arrays can be used [35]. Orthogonal arrays only identify the main effects and not the interactions between the parameters. This allows for an efficient screening of a large number of parameters and determining which parameters have a larger impact on the performance.

For designing aiNet algorithm, it is necessary to determine the related parameters including: size of population cell (\(N)\), multiples of clone (\(N_{c})\), percentage (\(d)\) of new cells, suppression threshold, and \(\beta \). These 5 parameters are used as design factors of the Taguchi experimental design. Each factor has three levels. This study refers to the parameters in the past literatures. Table 2 shows parameter setting of the literatures and an optimal parameter combination from several tests in this study. It is found that the \(N_{c}\) in the literature is the same. Thus, this study slightly adjusts \(N_{c}\) in order to obtain better parameter combination. This section uses the Iris data set as the object of the experimental combination analysis. The orthogonal array \({L}_{27} (5^{3})\) is used for experiment, and software package Minitab is used for Taguchi experimental design. The SED is applied as the criterion. A smaller SED indicates better clustering result. Thus, the experiment belongs to small-the-better. S/N ratio is signal-to-noise ratio and is used to evaluate system quality stability in the Taguchi design of experiments. A higher S/N ratio indicates smaller noise and better generated parameter quality. To sum up, the parameter setting of the proposed aiNetK algorithm and PSO, AIS, aiNet, PSKO, and AISK algorithms is shown in Table 3.

Table 2 Factor levels
Table 3 Correlation parameters of the algorithms

4.3 Experimental results and testing

4.3.1 Algorithm convergence

This study presents the convergence of different algorithms in the four benchmark data sets, as shown in Figs. 3, 4, 5 and 6. Based on the SED convergence map, ai-Net can evaluate average fitness value error and suppress excessively similar antibodies. Thus, convergence of the four data sets is quicker than the AIS and the PSO. Through local search of K-means algorithm with quick convergence, the proposed aiNetK algorithm in the four data sets has faster and better convergence than other algorithms.

Fig. 3
figure 3

SED convergence map of Iris data set using different algorithms

Fig. 4
figure 4

SED convergence map of Wine data set using different algorithms

Fig. 5
figure 5

SED convergence map of Glass data set using different algorithms

Fig. 6
figure 6

SED convergence map of Breast Cancer data set using different algorithms

4.3.2 Algorithm clustering results

(1) :

Mean value and standard deviation of SED

This study tests each algorithm in each benchmark data set 30 times and took the average value as the testing result, as shown in Table 4. As seen, in the most benchmark data sets, the SED and standard deviation obtained by the aiNetK algorithm are both the lowest.

Table 4 Clustering results of algorithms
(2) :

Computational time

For computational time, since aiNet has the characteristic of evaluating average fitness value error, the time needed by the proposed aiNetK is longer than those of other algorithms. The computational time for each algorithm is listed in Table 5.

Table 5 Computational time (seconds) of each algorithm in each benchmark data set
(3) :

Required number of iterations for convergence

To compare rapid convergence capability of the algorithms, the average value of SED of each benchmark data set is used as threshold value. Each algorithm is stopped, and the iteration number is recorded when the threshold value is reached. Contrarily, if the algorithm does not reach the threshold value when the iteration exceeds 2,000 times, convergence is regarded to fail. This study tests each algorithm 10 times in each benchmark data set, and the average iteration number of convergence is shown in Table 6. As shown, PSO and AIS have fast computing speed but worse search capability, and thus, most of data sets cannot reach threshold value. The aiNetK needs longer computational time but has better searching capability, it can reach convergence threshold needing minimum number of iterations.

Table 6 Average convergence times of each algorithm in each benchmark data set
(4) :

Accuracy

The accuracy of the clustering result of each algorithm for the four benchmark data sets is shown in Table 7. It indicates that the accuracy of proposed aiNetK for the four data sets is the highest among other algorithms.

Table 7 Accuracy of clustering result of each algorithm in each benchmark data set

4.3.3 Statistical test

Through confidence level of 95 % (\(\alpha =0.05\)), it is verified whether difference between aiNetK and other five algorithms is significant or not. \(\mu _{\mathrm{aiNetK}} \) is average SED value of aiNetK, \(\mu _{\mathrm{PSO}} \) is average SED value of PSO, \(\mu _{\mathrm{PSKO}} \) is average SED value of PSKO, \(\mu _{\mathrm{AIS}} \) is average SED value of AIS, \(\mu _{\mathrm{AISK}} \) is average SED value of AISK, and \(\mu _{\mathrm{aiNet}} \) is average SED value of aiNet. Then, pairwise comparison is verified:

The test statistic (\(Z_0 )\) is as follows:

$$\begin{aligned} Z_0 =\frac{\left( {\overline{X_1 } -\overline{X_2 } } \right) -\left( {\mu _1 -\mu _2 } \right) }{\sqrt{\frac{S_1^2 }{n_1 }+\frac{S_2^2 }{n_2 }}}. \end{aligned}$$
(11)

The left tail test of single-tailed test was conducted. If the test statistic value \(Z_0 <-Z_{\alpha } \), then \({H}_0 \) is rejected. Table 8 shows test results of other algorithms in the four data sets by using aiNetK.

Table 8 Test of other algorithms in the four data sets by using aiNet

In the statistical test, aiNetK is significantly better than other five algorithms in the benchmark data sets, except versus PSKO algorithm in Wine data set. Based on the above results, the average computational time of the aiNetK is the longest, since aiNetK integrates the two algorithms together. In addition, its algorithm structure is more complex than other algorithms, thus resulting in longest computational time. However, aiNetK has faster convergence. Further, aiNetK can converge to lowest average SED value. The velocity of convergence, convergence result, and accuracy of clustering result of aiNetK is better than those of other algorithms.

5 Conclusions

This study has proposed a novel cluster analysis technique, aiNetK algorithm, which integrates aiNet and K-means algorithms together. This can improve the clustering performance. The computational results of for benchmark data sets reveal that it has the lowest SED value compared to other clustering algorithms. In the future, the current idea can be applied to automatic clustering algorithm which does not need the number of clusters in advance. Additionally, other metaheuristics can also be applied like bee colony optimization algorithm.