1 Introduction

Cluster analysis plays an indispensable role in data mining and machine learning [1,2,3,4]. It is widely used in different fields such as information granulation [5,6,7], image processing [8], bioinformatics [9], security assurance [10] etc. The primary task of clustering is to group a set of objects into multiple clusters, which can identify the internal structure of massive data. In this way, the dissimilarity of samples in the same cluster is lower than that of samples in different clusters. There are many different clustering algorithms in cluster analysis, and these existing algorithms can be roughly divided into hierarchical clustering and partition clustering [11]. In this paper, we mainly focus on partition clustering. K-means clustering algorithm [12] is one of typical partition clustering, which was introduced by Macqueen [13] in 1967. In k-means algorithm, the distance is used as the evaluation index of similarity. The closer the distance between two objects is, the greater the similarity. It is not stranger to use the k-means algorithm to deal with the clustering of complete datasets. How to handle a specific data set containing missing data based on the k-means algorithm, this is still an urgent problem to be solved. Therefore, in this paper, we propose an algorithm for processing the clustering problem of incomplete data sets.

The traditional clustering algorithm like k-means can not deal with the datasets containing missing values straightly. However, in the actual scenario, some values in the data set are missing due to random noise, data lost, limitations of data acquisition, data misunderstanding etc. These objects with missing values in a specific data set are generally referred to as incomplete data set. According to the theory proposed by Rubin et al. [20, 21], the types of missing data can be classified as missing completely at random(MCAR), missing at random(MAR) and not missing at random(NMAR). Due to the emergence of missing data, k-means can not be applied to cluster the incomplete data sets directly. Therefore, how to deal with incomplete data sets is a problem to be solved in cluster analysis. In this paper, we consider the incomplete information system with missing completely at random(MCAR).

In the study of incomplete data clustering, the effective imputation method of missing values is the key to improve the accuracy of clustering result. There are many ways to fill incomplete data, for example, mean imputation, regression imputation, multiple imputation, hot-deck, cold-deck ect. [22]. EM algorithm [23] produces the maximum likelihood estimate value through iteration. The Most Common Attribute Value method [24] replaces the missing value with the most frequent attribute value. G. Doquire et al. [25] proposed the nearest-neighbor method based on mutual information to evaluate the missing data. J. Van Hulse et al. [26] used the mean values of k neighbors(complete data or incomplete data that have been imputed) with incomplete data to interpolate the missing data.

In order to solve the problem of incomplete data clustering, many scholars have proposed different clustering strategies. Hathaway and Bezdek [27] put forward four incomplete data clustering methods based on the fuzzy C-means clustering algorithm(FCM), which are Whole Data Strategy(WDS), Partial Distance Strategy(PDS), Optimal Completion Strategy(OCS) and Nearest Prototype Strategy(NPS), respectively. Zhang and Chen [28], in 2003, came up with some sorts of methods to solve the problem of clustering incomplete data by using kernel-based fuzzy c-means algorithm. On the basis of FCM algorithm, Li et al. [29] used the FCM clustering algorithm to process the incomplete data, and the premise is that the missing data is estimated by nearest neighbor interval. In Li et al. [30] proposed the attribute weighted FCM algorithm to solve incomplete data. In Li et al. [31] introduced the combination of genetic algorithm and FCM algorithm to get the clustering result and the estimated value of missing data. In Li et al. [32] studied a robust FCM clustering algorithm to deal with incomplete data. In addition, besides FCM algorithm, there are other ways for incomplete data clustering. Su et al. [33] put forward the three-way decision clustering algorithm for incomplete data based on q-nearest neighbors. Shi et al. [34] introduced a clustering ensemble algorithm for mixed data. Rencently, Mesquita et al. [35] used a new technique called artificial neural networks with random weights to deal with incomplete data clustering. All the above results enrich the theories and models of incomplete data clustering.

This paper proposes the KM-IMI algorithm, which is a kind of clustering algorithm for incomplete data sets. Due to the k-means algorithm can’t directly deal with incomplete data sets, the method of adding sample weights and analyzing perturbation distance of cluster centroid are introduced to achieve clustering result of incomplete data sets. Firstly, the incomplete data set was given, where the object with missing values is constrained by two conditions: (1) each original feature vector \(x_i\) retains at least one component; (2) each feature has at least one value present in the incomplete data set. Secondly, the k-means is used to process the set of objects with non-missing values to get clustering result. Thirdly, the set of objects with missing values are searching for the optimal imputation by adding sample weights and analyzing the change of cluster centroid. Finally, a kind of partition clustering algorithm is used to obtain the final clustering result.

The rests of this paper are organised as follows. Section 2 reviews k-means algorithm and incomplete information system. Section 3 introduces the KM-IMI algorithm. Section 4 reviews clustering performance measurement. Experiment results are reported in section 5.

2 Preliminaries

To facilitate the description of the proposed method, we introduce some basic concepts related to this paper, which include the k-means clustering algorithm and incomplete information system.

2.1 K-means Clustering Algorithm

The k-means algorithm was introduced by Macqueen [13] in 1967, which is a commonly cluster analysis method in data mining. It has been successfully applied in many fields like computer vision, market segmentation, astronomy, agriculture and geostatistics [19]. It has several advantages like: (1) the principle is uncomplicated and easy to implement. (2) the classic algorithm for solving clustering problems is simple and fast.(3) maintain scalability and high efficiency when dealing with large data sets. There are certain limitations for k-means algorithm: (1) the value of k has to be given in advance. (2) different initial clustering centroids lead to different clustering results. (3) it can easily fall into local optimal rather than global optimal results. (4) it is sensitive to noise and outliers. Some scholars focus on solving the shortcomings of k-means algorithm. Such as Yu et al. [16] proposed two improved algorithms, mainly aiming at the fact that the k-means is vulnerable to outliers and noisy data and also susceptible to initial cluster centers. Franti et al. [17] discussed the problem of reasonable initialization and iteration times of the k-means to improve its performance. It is the most widely used algorithm in clustering algorithms due to its simplicity and efficiency, although it has some drawbacks.

At present, many clustering algorithms are improved on the basis of the k-means, like k-means++ [14], k-prototype and k-mediods [15]. Chao [18] put forward an algorithm, which primarily combined the discrimination k-means and the spectral clustering to improve clustering performance and deal with high dimensional problem. The standard k-means algorithm has four steps: given that the data set \(\mathbf{U}=\{x_1, x_2, \ldots , x_n\}\) contains n objects, these objects have m attributes and the number of clusters is k; initialize the value of cluster center (specified or random); find the cluster of each object based on the shortest distance principle; recalculate the cluster centroid until the given convergence condition is met. Algorithm 1 gives the detailed process of k-means clustering algorithm.

figure a

2.2 An Incomplete Information System

The information system also known as the knowledge representation system. It can be represented as \(S=\{U, A, V, f\}\) or abbreviated as \(S=\{U, A\}\). \(U=\{x_1, x_2, \ldots , x_n\}\) is a finite non-empty set, called a universe, where n denotes the number of samples. \(A=\{a_1, a_2, \ldots , a_m\}\) is a finite set of non-empty attributes, where m represents the number of object features. \(V=\{V_1, V_2, \ldots , V_m\}\) is the set of object attribute values, \(V_i\) is the possible feature values of \(a_i\). f is an information function, \(f: V_{ik}=f(x_i, a_k)\in V_k\), \(V_{ik}\) represents the value of the sample \(x_i\) on the feature \(a_k\). For example, the \(x_i\) is the ith object with m features, namely, \(x_i=\{x_i^1, x_i^2, \ldots , x_i^m\}\), where \(x_i^l\) (\(l<m\), \(i\le n\))represents the value of lth feature of sample \(x_i\).

When some attribute values are missing, the information system S is called as the incomplete information system. This paper mainly discusses the incomplete information system with missing completely at random (MCAR). An example of the incomplete information system is shown in Table 1, which includes 6 samples and each sample has 4 attributes. Missing values are represented by *.

Table 1 An example of the incomplete information system

3 An Improved Mean Imputation Incomplete Data Clustering Algorithm

Suppose that \(\mathbf{U}=\{x_1, x_2, \dots , x_n\}\) is an incomplete data set with n objects. The object with missing values is constrained by two conditions: each original feature vector \(x_i\) retains at least one component and each feature has at least one value present in the incomplete data set. Naturally, the data set \(\mathbf{U}\) can be classified into two disjoint subsets. One set \(\mathbf{U_W}\) requires that each object x with non-missing values called complete data set. In contrast, another set \(\mathbf{U_M}\) is called an incomplete data set, where \(\mathbf{U_W} \cup \mathbf{U_M}=\mathbf{U}\). Algorithm 1 processes the object in set \(\mathbf{U_W}\) to get result \({\mathbb {C}}\). Algorithm 2 is used to fill the object with missing values in \(\mathbf{U_M}\) to obtain the result \(\mathbb {C'}\). Finally, the final clustering result \({\mathbb {C}}_{final}\) acquired by the k-means algorithm. The specific process of our algorithm is shown in Fig. 1.

Fig. 1
figure 2

Procedure diagram of our algorithm

figure b

In the study of incomplete data clustering, effectively fill in missing values of object is the key to improve the accuracy of clustering result. There are many methods to impute missing values, like mean imputation, regression imputation, multiple imputation etc. In this section, we present a kind of an improved mean imputation clustering algorithm for incomplete data. It can be briefly divided into four phases: (1) classifying the data set \(\mathbf{U}\) into two disjoint subsets: the set of objects with non-missing values \(\mathbf{U_W}\) and the set of objects with missing values \(\mathbf{U_M}\). (2) clustering the object in set \(\mathbf{U_W}\) through Algorithm 1. (3) filling the object with missing values in \(\mathbf{U_M}\) through Algorithm 2. (4) getting the final clustering result by the k-means. The improved mean imputation incomplete data clustering algorithm based on k-means, shorted by KM-IMI, is described in Algorithm 2.

The first phase, in this paper, an incomplete data set was obtained by setting the missing rate range from 5\(\%\) to 30\(\%\). Naturally, it is not complicated to distinguish the object with non-missing values or missing values according to the definition in Sect. 2.2. And the set of objects with non-missing values and the set of objects with missing values are represented by \(\mathbf{U_W}\) and \(\mathbf{U_M}\), respectively, where \(\mathbf{U_W} \cup \mathbf{U_M}=\mathbf{U}\).

The second phase, the Algorithm 1 is used to cope with the object in set \(\mathbf{U_W}\) and get the clustering result \(\mathbb {C}_W=\{C_W^1, C_W^2, \ldots , C_W^k\}\). The object with missing values in the set of \(\mathbf{U_M}\) need to be filled based on the clustering result \({\mathbb {C}}_W\).

The third phase, the method of mean attribute’s value of each cluster \(C_W^i\) is used to fill the missing value, respectively. The perturbation analysis of cluster center is applied to search the optimal imputation value. For example, the lth attribute value of the object x in set of \(\mathbf{U_M}\) is missing, ie \(x_l=*\). The Equation (1) is used to impute \(x_l\), and the k interpolation result \(x_l=\{x_l^1, x_l^2, \ldots , x_l^k\}\) is acquired spontaneously. The imputation formula is defined as follows.

$$\begin{aligned} x_l^i = \frac{1}{|C_W^i|}\sum _{ x \in C_W^i }x_l \end{aligned}$$
(1)

where \(x_l=*\) is the lth attribute of the object x and \(x \in \mathbf{U_M}\), \(x_l^i\) represents the i-th imputation result of the l-th attribute of the object x, \(|C_W^i|(i=1,2, \ldots , k)\) is the cardinality of the ith cluster.

Meanwhile, the method for disturbing distance of cluster centroid is applied to search the optimal imputation value. Each filled object x on corresponding cluster centroid by adding x with \(\frac{|C_W^i|}{k}(i=1,2,\ldots ,k)\) times into \(C_W^i\) and receive the new cluster \(C_W^{i*}\). The new cluster centroid was recalculated by Equation (2). Calculating the difference between the old and new cluster centroids, and assigning x to the certain cluster with the smallest difference. And then, the optimal imputation value of the object x with missing values is determined. The formula for calculating the new cluster centroid is as follows.

$$\begin{aligned} z_i^*=\frac{1}{|C_W^{i*}|}\left( \sum _{x \in C_W^{i*}}x\right) \end{aligned}$$
(2)

where \(z_i^*\) represents the new cluster center and \(|C_W^{i*}|\) is the cardinality of the \((i*)\)th cluster.

The fourth phase, via the third phase, the incomplete data set \(\mathbf{U}\) is transformed into complete data set. Using the k-means algorithm to achieve the final clustering result \({\mathbb {C}} =\{C_1, C_2, \ldots , C_k\}\).

4 Clustering Performance Measurement

Generally, clustering performance measurement also known as “validity index”. It is similar to the effect of performance metrics of supervised learning. As for validity index, we need to adopt it to evaluate the clustering result on one side. On the other side, it can be viewed as an optimization goal during the process of clustering when the validity index is determined.

The clustering performance measurement can be roughly divided into two types. One kind index named external index, which is compare the clustering result with a certain reference model. Another kind index is to directly examine the clustering result without using any reference models, which is called internal index.

4.1 Accuracy

The Accuracy is a frequently-used evaluation index and easy to understand. It represents the ratio between the number of correctly partitioned objects and the total number of samples. The greater value of Accuracy means the more objects are correctly divided. Otherwise, the fewer objects are correctly partitioned.

Definition 1

Accuracy(ACC hereafter).

$$\begin{aligned} ACC=\frac{1}{n}\sum _{i=1}^k \theta _i \end{aligned}$$

where the symbol n denotes the total number of samples in a dataset, \(\theta _i\) represents the amount of objects that are exactly divided into the i-th cluster and the letter k represents clustering number.

4.2 Davies-Bouldin Index

Davide L.davies and Donald W.bouldin [36] proposed the Davies-Bouldin Index is called DBI or DB for short. It mainly compute the distance between clusters and within the cluster. The smaller DBI means the farther distance between clusters and the closer distance within the cluster. Otherwise, the distance among different clusters is close and within the same cluster is far.

Definition 2

Davies-Bouldin Index(DBI hereafter).

$$\begin{aligned} DBI=\frac{1}{k}\sum _{i=1}^k max_{i\ne j}(\frac{S_i+S_j}{M_{i,j}}) \end{aligned}$$

Where the symbol k is the number of clusters, \(S_i\) and \(S_j\) represent within the cluster scatter for cluster i and j respectively, which has to be as low as possible. \(M_{i,j}\) denote the separation between the cluster i and the cluster j, which ideally has to be as large as possible. Hence, the DBI is defined as the ratio of \(M_{i,j}\) and the sum of \(S_i\) and \(S_j\). With this formulation is a measure of how good the clustering scheme is.

4.3 Silhouette Coefficient

The Silhouette Coefficient [37] was introduced by Peter J. Rousseeuw. It refers to a method of interpretation and validation of consistency within clusters of data. Meanwhile it is a measure of how similar an object is to its own cluster compared to other clusters. The value of Silhouette Coefficient ranges from -1 to +1, where a high value implies that the sample is well matched to its own cluster and poorly matched to neighboring clusters.

Definition 3

Silhouette Coefficient of single object.

$$\begin{aligned} S(i)=\frac{b(i)-a(i)}{max\{a(i),b(i)\}} \end{aligned}$$

Where a(i) is the average distance between i and all other data within the same cluster, we can interpret a(i) as a measure of how well i is assigned to its cluster. b(i) is the smallest average distance of i to all points in any other clusters, of which i is not a member. Furthermore, a large b(i) indicates that i is badly matched its neighbouring cluster. Thus an S(i) close to positive one means that the object is appropriately clustered. If S(i) is close to negative one, then by the same logic we see that is would be more appropriate if it was cluster in its neighbouring cluster.

Definition 4

Average Silhouette Coefficient(AS hereafter).

$$\begin{aligned} AS=\frac{1}{n}\sum _{i=1}^n S(i) \end{aligned}$$

Where the symbol n represent the total number of objects. S(i) denote the Silhouette Coefficient of single object i. The average S(i) over all points of a cluster is a measure of how tightly grouped all the points in the clusters are. Thus the average S(i) over all data of the entire dataset is a measure of how appropriately the data have been clustered.

Table 2 UCI Datasets used in the experiments
Table 3 Experimental results on UCI datasets
Table 4 Experimental results on UCI datasets
Table 5 Experimental results of ACC on UCI datasets

5 Experimental Illustration

To illustrate the effectiveness of Algorithm 2, the eight UCI [38]data sets employed in this subsection are Iris, Wine, Glass Identification (Glass), Wisconsin Diagnostic Breast Cancer (WDBC), Banknote, Contraceptive Method Choice (CMC), Pendigits and Page Blocks, respectively. Table 2 shows the details of these data sets. The purpose of our experiment is to verify the performance of the proposed algorithm for incomplete data. On one hand, comparing the average and best values of DBI, AS and ACC of the algorithm KM-IMI and KM-CD (the k-means clustering algorithm under complete datasets). On the other hand, for purpose of better show the superiority of the proposed method, we compare the results between the KM-IMI and the OCS-FCM, NPS-FCM, which are two kind of classical clustering algorithms for incomplete data. The experimental results are shown in Table 3, Table 4 and Table 5, respectively. And the detailed analysis of the experimental results is given below.

An incomplete data set \(\mathbf{U_M}\) was got by selecting some missing feature values randomly. The object with missing values is constrained by two restrictive conditions: (1) each original feature vector x retains at least one component;(2) each feature has at least one value present in the incomplete data set. That is to say, an object can not to be missing all attribute values and all objects can not be missing the same attribute. In most cases, the higher the missing rate in the dataset, the lower the accuracy of the clustering results. Because the higher the missing rate, the accuracy of incomplete data filling will also decrease, which directly leads to the performance degradation of the clustering algorithm. Therefore, in this paper, we set the missing rate range from 5\(\%\) to 30\(\%\).

Table 3 and Table 4 present the experimental results on eight UCI data sets, mainly make a comparison between the KM-IMI and KM-CD in the average and best value of DBI, AS and ACC. The underlined data indicates that the KM-IMI is not as good as the KM-CD. Based on this, we can find that the KM-IMI is superior to the KM-CD on most data sets in the average and best value of DBI and AS. Like Iris, Wine, WDBC, CMC and Page Blocks. However, the effect of the average and best value of Pendigits and Page Blocks are not so good as the KM-CD but the gap is only between 0.01 and 0.02. One of the reasons is that there is a proportional relationship between the missing rate and the accuracy rate, which directly leads to the performance degradation of the clustering algorithm. The average value of ACC in most data sets does not perform well, but the best value of ACC is effectiveness. It is worth mentioning that Banknote and CMC have performed well in the average and best value of DBI, AS and ACC. Therefore, we can conclude that the Algorithm 2 can be used to some extent to fill and cluster incomplete data sets. The results of several data sets in the experimental results are poor performance, indicating that the Algorithm 2 needs to be further improved and perfected.

In Table 5, the experimental results of the average and best value of ACC on six UCI data sets of the KM-IMI and OCS-FCM and NPS-FCM are shown, where the bold data represents the best results. We get the average and best ACC value through 100 times experimental test under different missing rates range from 5\(\%\) to 30\(\%\). Table 5 includes KM-IMI, OCS-FCM and NPS-FCM, respectively. The OCS-FCM and NPS-FCM are two classic algorithms for clustering incomplete data set, most methods based on FCM are implemented on the basis of [27]. By comparing the experimental results of the three algorithms on the average and best ACC values, the superiority of the proposed algorithm KM-IMI can be highlighted. From the experimental results recorded in Table 5, it can be seen that the average and best ACC experimental results of KM-IMI are better than OCS-FCM and NPS-FCM, like Iris, wine, Pendigits and Page Blocks. Only on a few data sets, KM-IMI is not better than OCS-FCM, like Glass and WDBC but the difference is between 0.01 and 0.04. After analysis, it is not difficult to find that the comparison method is based on the fuzzy C-means method, so it is difficult to obtain good results on non-spherical data sets. At the same time, on the Pendigits and Page Blocks datasets, the accuracy of this method is significantly higher than the comparison method. Therefore, the algorithm KM-IMI in this paper is better than the comparison algorithms OCS-FCM and NPS-FCM in the average and best ACC value.

In the experimental part, we not only compared our method with the original k-means algorithm, but also with the algorithms OCS-FCM and NPS-FCM. On one hand, the difference between the filling result and the actual value of the incomplete data can be obtained from the clustering result. On the other hand, the effectiveness of the algorithm can be verified by clustering performance measurement. Because the size of the missing rate will affect the accuracy of the algorithm, this paper has certain requirements on the range of missing rate. All these works have proved that our algorithm has better performance.

6 Conclusion

In this paper, our work focuses on how to impute and cluster incomplete data set. An improved mean imputation clustering algorithm for incomplete data based on k-means algorithm is proposed. In the proposed algorithm, we cluster the objects with non-missing values and use the mean attribute’s value of each cluster to fill the corresponding missing value, respectively. The method of perturbation analysis of cluster centroid is used to find the optimal filling value. Finally, the k-means algorithm clusters all objects including the original object and the filled object. In order to verify the effectiveness of our algorithm, on the one hand, the average and best value of DBI, AS and ACC of algorithms KM-IMI and KM-CD are compared. On the other hand, in order to better demonstrate the superiority of the proposed method, we compare the effects of the algorithm KM-IMI and two classical incomplete data clustering algorithms OCS-FCM and NPS-FCM. By comparing these experimental results, we can summarize that our algorithm is effective in filling incomplete data. However, the algorithm does not perform well in some data sets, which indicates that further improvements are needed.

There are some shortcomings in our algorithm. In the following work, we will consider how to improve and refine the interpolation and clustering algorithm for incomplete data, as well as think about how to combine incomplete data with the three-way decision clustering algorithm.