Keywords

1 Introduction

Asan unsupervised classification process, clustering [1] doesn’t need to provide the sample labels as prior knowledge, which can give the rational division only through the degree of similarity between samples. Clustering pays more attention to find the underlying structure of samples and collects the similar samples into the same cluster.

Artificial immune system [2, 3] is one of the important achievements in the field of artificial intelligence. It is inspired by the natural immune system and has been widely applied to engineering optimization [4], intrusion detection [5], data mining and so on. There are three classical theories in artificial immune system: clonal selection [6, 7], immune network [8, 9] and negative selection [10].

In Spectral Clustering algorithm (SC) [11], as an important clustering algorithm, the samples are regarded as the vertices and the level of similarity between the samples are regarded as the weighted edges. Corresponding, the clustering problem can be resolved by graph partitioning problem. SC is a good way to deal with non-diffuse datasets. However, SC requires the number of clusters as prior knowledge. In the case of real environment, the number of clusters is usually unknown, so this method is not practical sometimes. K-means [12] is the most popular algorithm for clustering, due to its simplicity, facility to implement and quick convergence. However, it is sensitive to the initialization of clustering center and converges to local optima solutions. Fuzzy C-Means Clustering Algorithm (FCM) [13] is another classical algorithm. Each sample has a degree belonging to different clusters rather than belonging to just one cluster. Thus, points on the edge of a cluster may have a lower degree than points in the center of the cluster. It is more objective to reflect the real world, however, the number of clusters is also needed in FCM.

Under the efforts of researchers, a lot of achievements have been put forward in artificial immune network [14]. The artificial immune network algorithm (aiNet) [15] and resource limited artificial immune system (RLAIS) [16] have become the most famous models. They are able to filter redundancy and reveal the potential structure. However, many parameters defined in aiNet. So it needs high computational cost, and it is sensitive to noise nodes. The improved artificial immune network clustering algorithm based on forbidden clone (FCAIN) [17] is proposed to improve weak denoising ability. However, it still needs to define many parameters and cannot shorten running time.

This paper proposes a new artificial immune network based on the secondary immune mechanism [18, 19] (SIMAIN), which can obtain the accurate network structure and shorten the running time. The competition selection strategy is employed to guide the process and reduce the number of iterations.

The remainder of this paper is arranged as follows. Section 2 not only reviews the significant aiNet algorithm, but also simply introduces the minimum spanning tree(MST) [20]. Section 3 gives technical details of the proposed SIMAIN. Then, some experimental results are discussed in Sect. 4 compared with other clustering algorithms. Section 5 draws some conclusions.

2 Related Works

2.1 Artificial Immune Network Algorithm

Artificial immune network [21, 22] is divided into antibody network and memory network [23, 24]. The datasets to be clustered is considered antigens, the obtained network nodes are treated as antibodies. The memory network is the basis of the immune response, which is made of selected antibodies. When antibody network is invaded by the antigens, it updates antibodies and adjusts the memory network.

After the learning process, the antibodies in the memory network represent internal images of the antigens. The aiNet algorithm aims at building a memory collection which recognizes and represents the data structure. In general, this algorithm is universal. However, there are still some shortcomings, such as the large number of the parameters, and the high calculation cost.

2.2 Minimum Spanning Tree

After getting the final memory network, we can get a simple network structure through the connection of the network nodes. Because the minimum spanning tree (MST) can describe and analyze the structure of the clustering network, we use the minimum spanning tree to obtain the relationship between the network nodes.

When the dimension of network nodes is less than or equal to two, the clustering structure obtained by MST directly. However, when the dimension of network nodes is equal to or more than three, the distances between network nodes will be obtained through the mapping diagram (bar chart) of the MST. If the performance of the algorithm is good, we can get the distance threshold from the bar chart obviously, and then classify the network nodes.

3 The Proposed SIMAIN Algorithm

3.1 Secondary Immune Mechanism

To cluster the datasets automatically and efficiently, an improved artificial immune network [25, 26] algorithm based on secondary immune mechanism is proposed.

According to the principles of immune mechanism, especially the immune memory and the secondary response. In the immune system, when antigens invade body, the antibodies will be produced to recognize antigens. When the same type antigens invade the body again, existing antibodies can recognize the antigens, and the memory cells will respond quickly and secrete the antibodies to remove antigens rapidly though immune memory. The process is known as the secondary immune response. This mechanism is named as Secondary Immune Mechanism (SIM).

In our algorithm, the clone operator and mutation operator are replaced by competition selection operator and competition selection strategy. Because the clone operator is used to clone antibodies with high affinity and increase the ability to search the optimal solution; and the mutation operator is used to expand the scope of search space. But, these evolutionary operators need multiple iterations and lead to low efficiency. However, our competition selection strategy can recognize antigens quickly through the choice of antibodies with high affinity, and increase the ability to identify antigens by stimulation degree. The higher the stimulation degree the better the ability to identify the antigens. Then, we can obtain the memory network only through selecting the antibodies with high stimulation degree.

The stimulation degree is inspired by the resource limited artificial immune system, it can identify noise nodes effectively and acquire accurate structure of datasets. The stimulation level (SL) is used to reveal the degree that the immune recognition ball (ARB) is stimulated by antigens. The ARB with higher stimulation level can acquire more resources, so the survival rate is high; on the contrary, the ARB with lower simulation level will be eliminated due to lacking resources. The lower the antigen density around ARB, the lower the stimulation level, thus the eliminated ARBs are the noise nodes. Our algorithm is not only useful for simple structure of artificial datasets, but also useful for complex datasets and real-world datasets. Memory cells in the final memory network can almost know all specificities of the antigens after two iterations. So under the help of the secondary immune mechanism, the running time is greatly reduced.

3.2 The Introduction of SIMAIN

And then we analyze the SIMAIN algorithm step by step in detail (Fig. 1).

Fig. 1.
figure 1

Flow chart of SIMAIN

  1. (1)

    Set recognition threshold l and initialize the network node Ab: The affinity recognition threshold l between antibodies and antigens is set reasonable. If the affinity is higher than l, the antibodies can recognize the antigens. Otherwise they can’t recognize each other. The network node Ab which has the same number of columns as Ag is generated randomly.

  2. (2)

    Choose \( Ag_{i} \) form Ag to invade Ab: \( Ag_{i} \) as an antigen selected from Ag randomly invades antibody network Ab.

  3. (3)

    Calculate the affinity: Calculate the affinity \( f_{ij} (j = 1,2, \ldots ,N_{ab} ) \) between \( Ag_{i} \) and each antibody in the current network Ab, which is based on distance \( D_{ij} \) as follows:

$$ D_{1,2} = \sqrt {\sum\limits_{d = 1}^{m} {(x_{1d} \text{ - }x_{2d} )^{2} } } = \left\| {x_{1} - x_{2} } \right\| $$
(1)
$$ D_{ij} = \left\| {Ag_{i} - Ab_{j} } \right\| $$
(2)
$$ f_{ij} = 1/D_{ij} $$
(3)

Where \( N_{ab} \) is the number of the current network Ab. When the \( D_{ij} \) is equal to zero, the \( f_{ij} \) is equal to infinity. Where \( D_{1,2} \) is the Euclidean distance between two samples, d is the dimension of sample. For any two nodes, the smaller their distance, the greater their affinity.

  1. (4)

    Is affinity higher than l?

  1. (a)

    Put the antibody \( Ab_{j} \) whose affinity is higher than l in memory network M as memory cell, and then add its stimulate level N;

$$ M \leftarrow [M;Ab_{j} ] $$
(4)
$$ N = N\text{ + }1 $$
(5)
  1. (b)

    Add the antigen whose affinity is less than l to the antibody network Ab as an antibody;

$$ Ab \leftarrow [Ab;Ag_{i} ] $$
(6)
  1. (5)

    Have all antigens in Ag invaded?: Insure all antigens have invaded antibody network Ab.

  2. (6)

    Has secondary immune finished?: Insure the process has completed the second cycle, namely secondary invasion.

  3. (7)

    Competition selection: Rank network nodes according to stimulate levels N and select the prior n% network nodes.

  4. (8)

    Network suppression: Eliminate the antibody nodes whose affinity are higher than the recognition threshold l1 until all antibody nodes can’t recognize each other in memory network M. The recognition threshold controls the specificity level of the antibodies, the clustering accuracy and network plasticity.

  1. (a)

    Calculate the affinity \( f_{ik} \) among all the antibody nodes in memory network M.

$$ f_{ik} = \frac{1}{{\left\| {M_{i} - M_{k} } \right\|}},M_{i} \in M,M_{k} \in M,\forall i,k $$
(7)
  1. (b)

    Eliminate the antibody nodes in memory network M whose \( f_{ik} \) is higher l1, where l1 is the affinity recognition threshold between antibody nodes.

  1. (9)

    Construct MST: Construct minimum spanning tree according to network nodes in memory network.

  1. (a)

    After the algorithm, a collection of antibody nodes in memory network \( M = \{ M_{1} ,M_{2} , \ldots ,M_{m} \} \) can be obtained, and m is the number of antibody nodes.

  2. (b)

    Construct a complete graph:

$$ G = (M,D) $$
(8)
$$ D = \{ D(M_{i} ,M_{j} )\left| {D(M_{i} ,M_{j} ) = \left\| {M_{i} - M_{j} } \right\|} \right.,i,j \in [1 \, m]\} $$
(9)
  1. (c)

    Construct MST and draw bar chart according to the distances between the adjacent network nodes.

  1. (10)

    Cut branches of the forest: The threshold which can separate categories is obtained, then cut branches of the forest according to the obtained threshold.

  2. (11)

    Output clustering result.

4 Experimental Results and Discussions

This section gives some comparative experiments and the related results. Several algorithms are used to compare with the proposed SIMAIN algorithm, such as K-means [27], FCM, SC, aiNet and FCAIN. These algorithms were coded in MatlabR2013b. The corresponding simulations have been carried out on a personal computer with Inter(R) M 370 2.4 GHz, 6 GB RAM, and Windows 7.

4.1 Experimental Datasets

In order to verify the clustering performance of proposed SIMAIN, two real-world datasets and seven artificial datasets are used. The real-world datasets are from UCI datasets. In order to avoid the instability of the experimental results, each dataset of each algorithm will be carried out 30 times and the experimental results are averaged. And we can see the stability level through the variance.

These artificial datasets represent different types. The Sticks and Spiral are non-convex. The AD_20_2 belongs to sphere distribution. The Sizes5 is diffuse. The Data9 is three-dimensional. The Data18 is 18-dimensional whose distribution is Gaussian distribution. The Data100 is 100-dimensional whose distribution is also Gaussian distribution. More details about the real-world and artificial datasets are described in Table 1.

Table 1. The details about datasets.

4.2 Parameter Setting

For the K-means, FCM and SC, the number of clusters is known in advance. And the scale parameter is specified in SC. For FCAIN, the threshold of forbidden clone is initialized.

We can obtain that the SIMAIN algorithm doesn’t need to define a lot of parameters and large number of iterations compared with FCAIN and aiNet. We only need to define the natural death threshold l, the suppression threshold l1, and the simulation degree in our algorithm. So, our algorithm reduces the dependence on parameters. And, two iterations is helpful to shorten running time.

4.3 Evaluation Index

In order to evaluate the clustering accuracy of SIMAIN, Clustering Accuracy (CA) [28], and Adjusted Rand Index (ARI) [29] are employed. It needs to be stated that the labels are used only for evaluation, the proposed algorithm doesn’t need the labels when clustering.

CA: It is the rate of correct labels, through comparing the true label of each sample with the label obtained by algorithm clustering results. It is defined as follows, where \( n_{i} \) represents the number of wrong samples which should belong to label i, and n is the number of all samples. CA is a value in the interval of [0, 1], and the bigger the value, the better the clustering effect.

$$ CA = 1 - \frac{{\sum\nolimits_{i = 1}^{k} {n_{i} } }}{n} $$
(10)

ARI: It is defined as follows, where \( n_{lk} \) represents the number of samples which belong to both cluster l and cluster k (\( l \in T,k \in S \)). T is the true cluster, and S is the obtained cluster. The ARI is also a value in the interval [0, 1], and the bigger value, the better the clustering effect.

$$ ARI = R(T,S) = \frac{{\sum\nolimits_{lk} {\left( {_{2}^{{n_{lk} }} } \right) - \left[ {\sum\nolimits_{l} {\left( {_{2}^{{n_{l * } }} } \right) * \sum\nolimits_{k} {\left( {_{2}^{{n_{ * k} }} } \right)} } } \right]/\left( {_{2}^{n} } \right)} }}{{1/2\left[ {\sum\nolimits_{l} {\left( {_{2}^{{n_{l * } }} } \right) + \sum\nolimits_{k} {\left( {_{2}^{{n_{ * k} }} } \right)} } } \right] - \left[ {\sum\nolimits_{l} {\left( {_{2}^{{n_{l * } }} } \right) * \sum\nolimits_{k} {\left( {_{2}^{{n_{ * k} }} } \right)} } } \right]/\left( {_{2}^{n} } \right)}} $$
(11)

4.4 Simulation Results and Discussions

In order to reflect the advantages of our algorithm specifically, We can visually see the experimental results from the Fig. 2, and more details are described in Tables 2, 3 and 4.

Fig. 2.
figure 2

The visualization results of some artificial datasets on the SIMAIN algorithm

Table 2. The results of clustering about CA index.
Table 3. The results of clustering about ARI index.
Table 4. The results of clustering about time.

It can be seen from Fig. 2 that our algorithm obtains good clustering results as a whole, and gets the clear cluster distribution.

From above Tables 2 and 3, we can obtain that the proposed SIMAIN has the best clustering results in these datasets as a whole. For Sticks, Spiral, Data9, Data18, and Data100, we can acquire the correct clustering results because our algorithm inherit the performance of the artificial immune network. For AD_20_2, the results of our algorithm is the best. For Sizes5 and Vote, although the results of our algorithm are not the best, but just a little worse than the best sometimes, and much better than the aiNet obviously. Because the structure of Sizes5 is diffuse, so that our algorithm can recognize the noise nodes. For Wine, although the effect of our algorithm is worse than SC algorithm about CA index, but the effect is the best about ARI index. And the stability of the clustering results of some datasets has been improved. So, it shows that our algorithm has made great progress.

From the Table 4, although the time is not always the shortest, but our algorithm is much better than aiNet and FCAIN. So, it proves that our algorithm has made great improvement in terms of time performance.

In general, the SIMAIN is a better algorithm not only can recognize the noise nodes and cluster datasets whose distribution is special, but also can shorten the running time to solve the disadvantage of the evolutionary algorithm.

5 Conclusions

This paper proposed an improved artificial immune network clustering algorithm based on secondary immune mechanism. The SIMAIN algorithm introduces the simulation level based on RLAIS and the secondary immune mechanism to improve the efficiency and accuracy of data clustering. The simulation results indicate that our algorithm is good at clustering datasets whose distribution is special and effectively recognize the noise nodes. Besides it enhances the ability to analyze the datasets whose boundaries of the distribution are not clear. On the basis of aiNet, the improved artificial immune network clustering algorithm also doesn’t need the number of clusters as prior knowledge. Most important of all, it reduces the number of input parameters and shortens the running time compared with aiNet and FCAIN. Therefore, it can be concluded that SIMAIN is an effective and efficiency algorithm for data clustering.

We will analyze datasets with high dimension or large-scale by using this algorithm in the next stage.