1 Introduction

Cluster analysis is an important technique in data mining and applied to diverse areas [1]. Fuzzy c-means (FCM) is one of the most widely applied clustering methods [24], with which not only the final clustering results can be obtained, but also which class the extent of the data belongs to can be decided according to the membership. Therefore, FCM is a valid method for complete data. However, in practical applications, many data sets suffer from incompleteness, and FCM can not be directly applied to handle such incomplete data sets.

In the last decade, a number of new strategies based on existing clustering methods have been proposed for solving the problem of incomplete data set clustering. The expectation–maximization (EM) has been used to handle incomplete data and probabilistic clustering for a longtime. Abas [5] combined EM algorithm with finite mixture model to deal with incomplete data. Furthermore, he [6] proposed EM and particle swarm hybrid algorithm to estimate the incomplete data values. Then Lin and Su [7] combined Bayesian classifier with the EM algorithm to process incomplete data in feature extraction.

In addition, Hathaway and Bezdek [8] proposed several specific methods for incomplete data clustering based on FCM: Whole data strategy (WDS) is one of the simplest methods, the data with missing attributes are discarded, and FCM is applied to the remaining complete data. Another method is partial distance strategy (PDS), which ignores missing attributes; the distance in the FCM algorithm is calculated by the local distance proposed by Dixon [9]. The estimation of the missing attributes is regarded as the optimization problem in optimal completion strategy (OCS), and better estimated values are obtained during the process of clustering iterations. The missing attributes are set as the corresponding attribute values from the nearest cluster prototypes during each iteration in nearest prototype strategy (NPS). Di Nuovo [10] applied a new technique in a psychological research environment using fuzzy clustering for incomplete data sets. Aydilek and Arslan [11] proposed a hybrid approach which used support vector regression and genetic algorithm to estimate missing values and optimize the parameters of FCM.

Furthermore, Simiński [12] used the marginalization and imputation to create the rough fuzzy clusters for missing data. The marginalization removes the tuples with missing values. The imputation is used to handle data with missing values. Nowicki [13] presented a new approach to fuzzy classification in the case of missing data. Rough-fuzzy sets are incorporated into logical type neuro-fuzzy structures, and a rough-neuro-fuzzy classifier is derived.

And the incomplete data can also be considered as variables to be optimized in optimization model. Dopazo and Ruiz-Tagle [14] used parameters scalarization and logarithmic target program design idea to establish the optimization model for missing data. Pei [15] established the fuzzy multi-attribute decision model to deal with incomplete data.

In order to make use of clustering distribution and data sets information to handle incomplete data, Himmelspach and Conrad [16] took the cluster dispersion into account and proposed a new membership degree for missing value estimation based on cluster dispersion. Zhang et.al [17] utilized information within incomplete instances (instances with missing values), when estimating missing values. Considering that different data types need different solutions, Subasi et.al [18] used the Boolean similarity measure to determine the missing binary data values. However, Hathaway and Bezdek [19] proposed the FCM clustering method according to the mathematical triangle inequality rules for incomplete relational data set clustering.

Recently, the nearest-neighbor [20, 21] method is increasingly used to handle incomplete data. Doquire and Verleysen [22] used the nearest-neighbor method based on mutual information to estimate the incomplete data values. Van Hulse and Khoshgoftaar [23] used complete data and incomplete data that have been estimated to search K nearest-neighbors of incomplete data, missing attributes are replaced by the mean values of these neighbors’ corresponding attributes.

The missing attributes are represented by the numerical values in most of the above approaches. Because of the uncertainty of missing attributes, replacing missing attribute values by intervals can improve the robustness of the missing attributes estimation. The nearest-neighbors of missing attributes were used to determine missing attribute intervals (NNI) in the literature [24] which dealt with clustering problem for incomplete data by transforming the data set to an interval-valued one. Then the interval-valued FCM based on gradient completed the incomplete data clustering. But the restriction that missing attributes and nearest-neighbors are in the same cluster is not taken into account during determining intervals.

Based on neighbor interval reconstruction (NIR), a PSO and FCM hybrid algorithm are proposed for incomplete data in this paper. Firstly, PDS–FCM is used in the pre-classification of the incomplete data set, and then, the nearest-neighbors of incomplete data are found using the attribute distribution information. According to the results of pre-classification, the nearest-neighbors that are in different cluster with incomplete data are removed, the rest of congeneric neighbors are used to determine the missing attribute intervals, which can avoid endpoints of intervals determined by different species information. Therefore, the accuracy of missing attributes imputation will be enhanced. The incomplete data set is transformed into the interval-valued data set. Secondly, the proposed NIR hybrid algorithm is used for interval-valued data set clustering. Particles are encoded by the cluster prototypes in the hybrid algorithm; and memberships are still obtained by the gradient-based alternating iterative formula. The NIR hybrid algorithm is beneficial to improve the clustering performance.

The remaining parts of this paper are organized as follows. Section 2 presents the reconstruction of nearest-neighbor intervals for missing attributes. The PSO and FCM hybrid algorithm for incomplete data clustering are introduced in Sect. 3. Section 4 presents clustering results of several UCI data sets and a comparative study of our proposed algorithm with various other methods for handling missing values in FCM. Finally, conclusions are drawn in Sect. 5.

2 The intervals estimation

We use nearest-neighbor intervals to represent missing attributes. The partial Euclidean distance is used to calculate the distance between data in incomplete data set. According to the distance, the q nearest-neighbors [24] of an incomplete datum can be selected, and their corresponding attributes must be complete. Then, the minimum and maximum of the neighbors’ corresponding attribute values are used to determine the value range of the incomplete datum’s missing attribute.

\( \bar{X} = \left\{ {\tilde{x}_{1} ,\,\tilde{x}_{2} ,\, \ldots ,\,\tilde{x}_{n} } \right\} \) is an s-dimensional incomplete data set, which contains at least one incomplete datum with some (but not all) missing attribute values. The partial distance [9] calculation of \( \tilde{x}_{b} \) and an instance \( \tilde{x}_{p} \) (incomplete or complete) is calculated by the formula (1):

$$ D_{pb} = \frac{1}{{\sum\nolimits_{j = 1}^{s} {I_{j} } }}\sum\limits_{j = 1}^{s} {\left( {\tilde{x}_{{_{jb} }} - \tilde{x}_{{_{jp} }} } \right)^{2} I_{j} } $$
(1)

where \( \tilde{x}_{jb} \) and \( \tilde{x}_{jp} \) are the jth attribute of \( \tilde{x}_{b} \) and \( \tilde{x}_{p} \), respectively, and the value of I j is 0 or 1. If both \( \tilde{x}_{jb} \) and \( \tilde{x}_{jp} \) are non-missing, I j is 1; otherwise, I j is 0.

The partial distance calculation makes use of the information of complete data and known attributes in incomplete data. The missing attribute \( \tilde{x}_{jb} \) can be represented by its corresponding nearest-neighbor interval [\( x_{jb}^{ - } ,\;x_{jb}^{ + } \)], and the non-missing attribute \( \tilde{x}_{jw} \) can be rewritten into interval form [\( x_{jw}^{ - } ,\;x_{jw}^{ + } \)], but \( x_{jw}^{ - } = x_{jw}^{ + } = \tilde{x}_{jw} \).

Considering that not all neighbors are in the same cluster with the incomplete data and whetherthe intervals’ endpoints are determined by corresponding attributes of those different species neighbors, it will cause an obviously unreasonable interval as shown in Fig. 1. Where, the circle sample in cluster 1 is an incomplete datum with missing attributes, and the dashed frame rings out its nearest-neighbors. It can be seen that nearest-neighbors contain samples of cluster 2.

Fig. 1
figure 1

The interval estimation with different species

In order to solve this problem, we adopt PDS to pre-classify incomplete data set. Whether q nearest-neighbors contain different species data or not can be judged from the pre-classification results. If q nearest-neighbors contain different species data, then different species data are removed. And the rest of congeneric neighbors are used to determine the endpoints of missing attributes intervals.

3 Particle swarm and fuzzy c-means hybrid optimization

3.1 The interval fuzzy c-means

Let \( \overline{X} = \left\{ {\overline{x}_{1} ,\overline{x}_{2} , \ldots ,\overline{x}_{n} } \right\} \) be an interval-valued data set to be partitioned into c clusters, the attribute in the data set is \( \overline{x}_{k} = \left[ {\overline{x}_{1k} ,\overline{x}_{2k} , \ldots ,\overline{x}_{sk} } \right]^{T} \), \( \forall j,k:\overline{x}_{jk} = \left[ {x_{jk}^{ - } ,x_{jk}^{ + } } \right] \). The objective function of the interval-valued data FCM is:

$$ J\left( {\varvec{U},\overline{\varvec{V}} } \right) = \sum\limits_{i = 1}^{c} {\sum\limits_{k = 1}^{n} {u_{ik}^{m} } } \left\| {\overline{\varvec{x}}_{k} - \overline{\varvec{v}}_{i} } \right\|_{2}^{2} $$
(2)

with the constraint of

$$ \sum\limits_{i = 1}^{c} {u_{ik} = 1,\quad k = 1, 2, \ldots , n} $$
(3)

The interval cluster prototypes matrix is \( \overline{\varvec{V}} = \left[ {\overline{v}_{ij} } \right] = \left[ {\overline{\varvec{v}}_{1} ,\overline{\varvec{v}}_{2} , \ldots \overline{\varvec{v}}_{c} } \right] \), where \( \overline{v}_{ij} = \left[ {v_{ji}^{ - } ,v_{ji}^{ + } } \right] \), ∀i = 1, 2, … c, j = 1, 2, … s. The Euclidean distance between \( \overline{\varvec{x}}_{k} \) and \( \overline{\varvec{v}}_{i} \) is defined as:

$$ \left\| {\overline{\varvec{x}}_{k} - \overline{\varvec{v}}_{i} } \right\|_{2} = \left[ {\left( {\varvec{x}_{k}^{ - } - \varvec{v}_{i}^{ - } } \right)^{T} \left( {\varvec{x}_{k}^{ - } - \varvec{v}_{i}^{ - } } \right) + \left( {\varvec{x}_{k}^{ + } - \varvec{v}_{i}^{ + } } \right)^{T} \left( {\varvec{x}_{k}^{ + } - \varvec{v}_{i}^{ + } } \right)} \right]^{\frac{1}{2}} $$
(4)

where \( \varvec{x}_{k}^{ - } = \left[ {x_{1k}^{ - } ,x_{2k}^{ - } , \ldots ,x_{sk}^{ - } } \right]^{T} \), \( \varvec{x}_{k}^{ + } = \left[ {x_{1k}^{ + } ,x_{2k}^{ + } , \ldots x_{sk}^{ + } } \right]^{T} \) and

$$ \varvec{v}_{i}^{ - } = \left[ {v_{1i}^{ - } ,v_{2i}^{ - } , \ldots ,v_{si}^{ - } } \right]^{T} ,\,\varvec{v}_{i}^{ + } = \left[ {v_{1i}^{ + } ,v_{2i}^{ + } , \ldots ,v_{si}^{ + } } \right]^{T} . $$

The updating formulas of cluster prototypes are as follows:

$$ \varvec{v}_{i}^{ - } = \frac{{\sum\nolimits_{k = 1}^{n} {u_{ik}^{m} \varvec{x}_{k}^{ - } } }}{{\sum\nolimits_{k = 1}^{n} {u_{ik}^{m} } }},\quad i = 1,2, \ldots ,c $$
(5)
$$ \varvec{v}_{i}^{ + } = \frac{{\sum\nolimits_{k = 1}^{n} {u_{ik}^{m} \varvec{x}_{k}^{ + } } }}{{\sum\nolimits_{k = 1}^{n} {u_{ik}^{m} } }},\quad i = 1,2, \ldots ,c $$
(6)

And if \( \exists k,h,1 \le k \le n,1 \le h \le c,\forall j:\overline{x}_{jk} \subseteq \overline{v}_{jh} \), that is, \( \overline{x}_{k} \) is within the convex hyper-polyhedron formed by \( \overline{v}_{h} \), then \( \overline{x}_{k} \) can be considered to belong fully to the cluster with membership 1, and belong to the other clusters with membership 0, thus [24]

$$ u_{ik} = \left\{ \begin{gathered} 1,\quad i = h \hfill \\ 0,\quad i \ne h \hfill \\ \end{gathered} \right.,\quad i = 1,2, \ldots ,c $$
(7)

In other cases, the membership is calculated by formula (8):

$$ u_{ik} = \left[ {\sum\limits_{t = 1}^{c} {\left( {\frac{{\left\| {\overline{\varvec{x}}_{k} - \overline{\varvec{v}}_{i} } \right\|_{2}^{2} }}{{\left\| {\overline{\varvec{x}}_{k} - \overline{\varvec{v}}_{t} } \right\|_{2}^{2} }}} \right)}^{{\frac{1}{m - 1}}} } \right]^{ - 1} ,\quad i = 1,2, \ldots ,c $$
(8)

FCM can get the final clustering results; additionally, it calculates the degree of each data belonging to each cluster according to the membership, which obtains more clustering information. However, memberships and cluster prototypes must be initialized firstly in FCM, so the algorithm depends on the initial selection greatly. In addition, the memberships of FCM are calculated based on gradient descent mechanism, and the algorithm is easy to be trapped in the local optimal. To solve above problems, we introduce the swarm optimization strategy.

3.2 The hybrid optimization algorithm

PSO is a stochastic global optimization tool [2527] which can be used to search for global optimal cluster prototypes of FCM. The PSO algorithm starts with a population of particles whose positions represent the potential solutions for the studied problem, and the velocity is randomly initialized in the search space. The performance of each particle (i.e., how close the particle is to the global optimum) is measured using a fitness function that varies depending on the optimization problem [28]. In the process of iterations, the search for optimal position is performed by updating velocities and positions of particles. The velocity of each particle is updated using two best positions, that is, the individual best position and the global best position. Each particle memorizes its own best position encountered so far which is called the individual best. On the other hand, the population memorizes the best position among all individual best positions obtained so far which is called the global best [29]. Then, a new generation of swarm is produced. The swarm conduct global search in the whole solution area during optimization process.

Particle swarm whose population size is n expressed as X = {x 1, x 2, …, x n }, x i and v i denote the position and velocity of the ith particle; p b and g b denote the individual best position and the global best position. The velocity and position updating formulas of particle are as follows:

$$ v_{i} \left( {t + 1} \right) = wv_{i} \left( t \right) + c_{1} {\text{rand}}\left( {p_{b} \left( t \right) - x_{i} \left( t \right)} \right) + c_{2} {\text{rand}}\left( {g_{b} \left( t \right) - x_{i} \left( t \right)} \right) $$
(9)
$$ x_{i} \left( {t + 1} \right) = x_{i} \left( t \right) + v_{i} \left( {t + 1} \right) $$
(10)

where w is the inertia weight, c 1, c 2 are the learning factors and rand is a randomly generated number between 0 and 1.

One shortcoming of the classic particle swarm is that it is easy to trap in the stagnation. During the stagnation time, the velocity of the particle is almost zero; particles gather near a point, which means that the algorithm is trapped in the local optimal [30]. The variation is put in the iterative process of the particle swarm to allow the particle swarm to escape from the local optima. When the particle corresponding to the global best position has not been improved for consecutive A times, the particle swarm is considered to have gathered to a local optimum location [31] and positions of the entire particle swarm are mutated as a certain variation probability. We take the mutation probability as \( p\left( t \right) = \frac{1}{{1 + t^{0.5} }} \), t is the number of iterations. The position of the particle is changed by the following formula:

$$ x_{i} \left( t \right) = {\text{rand}} \times \left( {\hbox{max} - \hbox{min} } \right) + \hbox{min} $$
(11)

max, min are the upper and lower bounds of the search space.

The NIR hybrid algorithm minimizes the objective function which is illustrated as the formula (2). Cluster prototypes are represented by particles of particle swarm; and memberships are still obtained by the gradient-based alternating iterative formula, as the formula (8). New cluster prototypes and memberships are obtained through updating velocities and positions of particles. From this, the NIR hybrid algorithm is proposed. The process is as follows:

  • Step (1) Determine the missing attribute intervals:

    1. 1.

      For each incomplete data, the q nearest-neighbors are determined according to the formula (1);

    2. 2.

      PDS is used in pre-classification of data set, and different species data are removed according to the results. The interval [\( x_{jb}^{ - } ,\;x_{jb}^{ + } \)] of x jb is determined, if x jb is non-missing attribute, then \( x_{jb}^{ - } ,\;x_{jb}^{ + } \).

  • Step (2) Initialization: Cluster prototypes are represented by particles, the number of particles in the particle swarm and the particle dimension are determined, velocities and positions of particles are initialized and the maximum number of iterations is set.

  • Step (3) The memberships are calculated using formula (8); the objective function values are calculated using formula (2). The objective function values are used as the fitness value of each particle.

  • Step (4) The global optimum options: for each particle, its fitness value is compared with the fitness value of best position which the swarm experienced, if better, it is used as the global optimum.

  • Step (5) Judge whether the algorithm satisfies the variation condition or not, if it satisfies the variation condition, the particles are mutated using formula (11); otherwise, go to the next step.

  • Step (6) The velocities and positions of particles are updated using formula (9) and formula (10).

  • Step (7) Judge whether the algorithm satisfies the terminated condition (the iterations get the maximum number or the error is less than a given error), if it satisfies the terminated condition, obtain the terminated classification matrix and cluster centers, otherwise go to the step 3).

4 Experimental results and analysis

Five data sets of the UCI database: Iris, Wine, Bupa, Haberman and Breast are selected to do the simulation experiments, and the experimental results of NIR are compared with five methods WDS, PDS, OCS, NPS and NNI, thereby verifying the effectiveness of NIR. The information of data sets as shown in Table 1.

Table 1 The information of data sets

The Iris data set contains 150 four-dimensional attribute vectors, depicting four attributes of Iris flowers, which include Petal Length, Petal Width, Sepal Length and Sepal Width. The three Iris classes involved are Setosa, Versicolor and Virginica, each containing 50 vectors.

The Wine data set is the results of a chemical analysis of wines grown in the same region but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The data set contains 178 data points.

The Bupa Liver Disorder data set includes 345 samples in six-dimensional space. The first five attributes are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each data point constitutes the record of a single male individual, and the data set has two clusters.

The Haberma data set contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. The first three attributes are information of patients, respectively, is patient’s age at time of operation, patient’s year of operation and number of positive axillary nodes detected, and the last attribute is survival status (class attribute) including three classes.

Samples of the Breast data set arrive periodically as Dr. Wolberg reports his clinical cases, which contain nine attributes describing the details of each case. Besides, two additional attributes are sample code number and class attribute: benign and malignant.

Missing attributes are randomly generated by human. The relevant parameters in the experiments are set as follows: q is set to 6, A is set to 10, the size of the particle swarm is 100, the number of iterations is 500 times and the missing data rates are taken as 0, 5, 10, 15 and 20 %. The experimental results are as shown from Tables 2, 3, 4, 5 and 6, and the optimal and sub-optimal results are marked by bold and underlined types.

Table 2 Averaged results of 30 trials using incomplete Iris data set
Table 3 Averaged results of 30 trials using incomplete Wine data set
Table 4 Averaged results of 30 trials using incomplete Bupa data set
Table 5 Averaged results of 30 trials using incomplete Haberman data set
Table 6 Averaged results of 30 trials using incomplete Breast data set

As can be seen from Tables 2, 3, 4, 5 and 6,

  1. 1.

    The test data sets in the case of different attributes missing rates, in general, the experimental results of NIR are the best, the average number of misclassification is the least. Only when the missing rate of wine was 20 %, the result is slightly worse than WDS and in sub-optimal. Moreover, results of NIR and NNI are better than PDS, OCS and NPS in general, namely, the intervals estimation of missing attributes is better than the point estimation. It is because that the interval estimation can improve the robustness of missing attributes estimation. Among them, the results of NNI on Haberman data set are slightly worse owing to the following restriction. Namely, it failed to take into account that the nearest-neighbors are in the same cluster with incomplete data during the determining intervals in NNI.

  2. 2.

    Results of NIR are better than NNI, the missing rate is higher and the accuracy of clustering results is better with NIR. This is because the estimated intervals are restricted in NIR and different species neighbors are removed. In addition, the NIR hybrid algorithm is used for interval-valued data set clustering. The hybrid algorithm utilizes the global optimization capability of particle swarm to search the optimal clustering results, and memberships are still obtained by the gradient-based alternating iterative formula. Compared with the FCM, the hybrid algorithm can avoid the algorithm to be trapped in local convergence and deal with the problem of sensitive to the initial values, thereby improving the accuracy of the results.

  3. 3.

    Compared with other methods, the mean number of iterations to termination of NIR is more. However, the NIR hybrid algorithm makes use of the global search ability of intelligent optimization and obtains better clustering results; thus, the algorithm efficiency must be slightly inferior to the gradient-based optimization search method. The curves between the objective function and iterations under the missing rate are 0, 5, 10, 15 and 20 % with NPF on Iris, Wine, Bupa, Harberman and Breast data sets are shown in Fig. 2.

    Fig. 2
    figure 2

    The change curves between target function with iterations of data sets. a IRIS, b Wine, c Bupa, d Haberman and e Breast

5 Conclusions

The NIR method removes different species information based on the pre-classification results of incomplete data set, which makes the intervals estimation of missing attribute more reasonable. Then the NIR hybrid algorithm is used for the interval-valued data set clustering. The hybrid algorithm utilizes the global optimization capability of particle swarm to optimize the cluster prototypes in FCM, which can make the algorithm get more accurate clustering results. The experimental results show that NIR hybrid algorithm has more advantages over other methods in accuracy and more effective when it is applied to the incomplete data clustering.