Keywords

1 Introduction

Cluster analysis [1,2,3,4,5] is an important function of data mining, and the clustering algorithm is the core of current research. Clustering is to divide the data set into multiple clusters or classes based on the similarity between a set of unlabeled data objects. A good clustering algorithm should be able to produce high-quality clustering results: clusters. These clusters must have two characteristics: (1) high intra-cluster similarity; (2) low inter-cluster similarity. The quality of the clustering results depends not only on the similarity evaluation method and its specific implementation, but also on whether the method can find some hidden patterns or all hidden patterns [6].

Fuzzy C-means (FCM) clustering algorithm [7,8,9,10] is one of the widely used fuzzy clustering algorithms. FCM algorithm belongs to the category of fuzzy clustering algorithms based on objective functions. However, the traditional FCM algorithm has some disadvantages. One of the issues is that the number of clusters needs to be determined manually, and the algorithm is sensitive to the initial cluster center. In addition, the FCM algorithm is prone to problems such as multiple clustering iterations, low convergence speed and local optimal solution. Many algorithms have been proposed to improve the FCM algorithm. Wang et al. [11] systematically improved the traditional fuzzy clustering algorithm. They proposed a new method by combining PSO (particle swarm optimization) and fuzzy C-means algorithm. By a simple and effective particle encoding method, the best initial cluster center and fuzzy weighting exponent were both searched in the process of PSO. Li et al. [12] proposed a scheduling algorithm based on fuzzy clustering and two-level scheduling mode. Geweniger et al. [13] combined the median c-means algorithm with the fuzzy c-means method to improve the accuracy of the algorithm. Median clustering is a powerful methodology for prototype based clustering of similarity/dissimilarity data. The approach is only applicable for vector (metric) data in its original variant. Wang et al. [14] presented a rough-set [15, 16] based measurement for the membership degree of fuzzy C-means algorithm, and take the advantage of the positive region set and the boundary region set of rough set. Lai et al. [17] presented a rough k-means clustering algorithm by minimizing the dissimilarity to solve the divergence problem of the original approaches that the cluster centers may not be converged to their final positions. Cai et al. [18] proposed a novel initial cluster centroids selection algorithm, called WLV-K-means (weighted local variance K-means). The WLV-K-means algorithm employs the weighted local variance to measure the density of each sample, which can find samples with higher density. This algorithm also uses the improved max-min method to select cluster centroid heuristically. Liu et al. [19] proposed to combine the FCM algorithm and DPC (Clustering by fast search and find of density peaks) algorithm. Firstly, DPC algorithm is used to automatically select the center and number of clusters, and then FCM algorithm is used to realize clustering. The comparison experiments show that the improved FCM algorithm has a faster convergence speed and higher accuracy. Khan and Ahmad [20] proposed a new cluster center initialization algorithm (CCIA). By clustering the samples in each dimension, we find that the K′ (K′ > K) clusters have the same pattern points, and get the center points of the K′ clusters. Then we use the data compression method in reference [21] to merge the neighborhood of high-density samples, and finally get the K initial center points. In this paper, we fully consider the constraints of cluster centers in the process of cluster center selection and optimization. Firstly, the initial cluster center is selected by calculating the local density of each sample point. Then the selected initial cluster center is combined with the traditional FCM algorithm to cluster the data. Therefore, we propose a novel fuzzy C-means clustering algorithm based on local density (LD-FCM).

The rest of this paper are organized as follows. In Sect. 2, the concept of fuzzy clustering and Fuzzy C-means clustering algorithms are briefly reviewed. Some important preliminary knowledge used in our proposed approaches are stated. In Sect. 3, we present the algorithms proposed in this paper, and some theories and analysis necessary in it. In Sect. 4, experimental studies are conducted to verify the effectiveness of our proposed algorithm. Section 5 concludes the paper.

2 Preliminaries

2.1 Fuzzy Cluster Analysis

The concept of fuzzy clustering was firstly proposed by Professor Ruspiniv [22]. Fuzzy clustering is an algorithm combining fuzzy mathematics with clustering methods. Fuzzy clustering determines the fuzzy relationship between the samples by method of fuzzy mathematics. In other words, the clustering results are blurred, so that the problem of data attribution in the real world can be described objectively from multiple angles. Therefore, fuzzy clustering analysis has become one of the mainstream directions of clustering research.

Fuzzy clustering [23] calculates the similarity between different data samples by using some distance measurement method. Each data sample is divided into different clusters according to the similarity between data samples. For any number of data sample subsets k (1 ≤ k ≤ C), where C is the number of clusters, the data sample \( X _{i} \left( {1 \le i \le N} \right) \) (N is the number of samples) will belong to this cluster with a fuzzy membership degree, which is similar to a probability value. The fuzzy clustering will obtain membership matrix \( \left[ {u_{ik} } \right] \,\left( {1 \le k \le C,1 \le i \le N} \right) \) and cluster center \( V = \left\{ {v_{1} ,v_{2} , \cdots v_{c} } \right\} \). And then the membership matrix is judged by hardening matrix technology to determine the final attribution result of data samples. The membership matrix is composed of the fuzzy degree that each data sample belongs to a subset. The value range of each element in the membership matrix is [0, 1]. In other words, if the membership degree of data sample to a subset is greater than that of other subsets, it means that the sample is more likely to belong to the subset. When \( u_{ik} = 1 \), it means that \( x_{i} \) belongs completely to the k-th cluster, while when \( u_{ik} = 0 \), it means that \( x_{i} \) does not belong to the k-th cluster at all.

2.2 Fuzzy C-means Clustering Algorithm

Fuzzy C-means (FCM) clustering algorithm [8] is an improvement of the common C-means algorithm. The common C-means algorithm is hard for data classification, while FCM is a soft fuzzy division. Many of the discussions in this paper are based on the FCM algorithm. Therefore, the FCM algorithm is described in detail.

Supposed that the data sample set \( X = \left\{ {x_{1} ,x_{2} , \ldots ,x_{n} } \right\} \subset R^{s} \) is an s-dimensional data set in Euclidean space, n is the number of samples. Where \( x_{i} \) contains the s dimensions, which is expressed as \( x_{i} = \left\{ {x_{i}^{1} , \ldots ,x_{i}^{d} , \ldots ,x_{i}^{s} } \right\} \). FCM algorithm divides X into C classes (\( 2 \le C \le n \)), and has C cluster centers \( V = \left\{ {v_{1} ,v_{2} , \ldots ,v_{c} } \right\} \). Thus, FCM algorithm can be expressed as the following mathematical programming matters:

$$ Minimize\, J\left( {X,U,V} \right) = \sum\nolimits_{j = 1}^{n} {\sum\nolimits_{i = 1}^{c} {\left( {u_{ij} } \right)^{m} \left\| {x_{j} - v_{i} } \right\|^{2} } } $$
(1)

And satisfied

$$ \sum\nolimits_{i = 1}^{c} {u_{ij} = 1 \,\left( {j = 1,2, \cdots ,n} \right)} $$
(2)

Where \( u_{ij} \) is the membership degree of data sample \( x_{j} \) belonging to a certain class \( i \). \( U = (u_{ij} )_{c \times n} \) is the fuzzy partition matrix. The value of membership degree of each data sample relative to each cluster can be found from the fuzzy partition matrix. The similarity between the data sample \( x_{j} \) and the class center of the class \( i \) is calculated by Euclidean distance, which is recorded as \( d_{ij} = \left\| {x_{j} - v_{i} } \right\| \). \( m \) is the fuzzy weight index, also known as the fuzzy factor. \( m \) is mainly used to adjust the fuzzy degree of the fuzzy partition matrix.

The specific steps of the algorithm are as follows.

  • Step 1: We set the number of clusters C and the fuzzy factor \( m \) (usually 1.5 to 2.5), We initialize the membership matrix \( U^{\left( \gamma \right)} \left( {\gamma = 0} \right) \), and make it satisfy the Eq. (2);

  • Step 2: The cluster center \( V^{{\left( {\gamma + 1} \right)}} = \left\{ {v_{1} ,v_{2} , \ldots ,v_{c} } \right\} \) is updated according to Eq. (3);

    $$ v_{i}^{{\left( {\gamma + 1} \right)}} = \frac{{\mathop \sum \nolimits_{j = 1}^{n} (u_{ij}^{\left( \gamma \right)} )^{m} \cdot x_{j} }}{{\mathop \sum \nolimits_{j = 1}^{n} (u_{ij}^{\left( \gamma \right)} )^{m} }},i = 1,2, \ldots ,c $$
    (3)
  • Step 3: The membership matrix \( U^{{\left( {\gamma + 1} \right)}} = (u_{ij} )_{c \times n} \) is updated according to Eq. (4);

    $$ u_{ij}^{{\left( {\gamma + 1} \right)}} = [\sum\nolimits_{k = 1}^{c} {\left( {\frac{{\left\| {x_{j} - v_{i} } \right\|^{2\left( \gamma \right)} }}{{\left\| {x_{j} - v_{k} } \right\|^{2\left( \gamma \right)} }}} \right)^{{\frac{2}{m - 1}}} ]^{ - 1} ,i = 1,2, \ldots ,c;\;j = 1,2, \ldots ,n} $$
    (4)
  • Step 4: We calculate \( e = \left\| {U^{{\left( {\gamma + 1} \right)}} - U^{\left( \gamma \right)} } \right\| \). If \( e \le \eta \) (\( \eta \) is the threshold, generally 0.001 to 0.01), then the algorithm stops; Otherwise, \( \gamma = \gamma + 1 \), go to Step 2.

  • Step 5: The samples are classified and output according to the final membership matrix U. If the sample \( x_{j} \) satisfies \( u_{ij} > u_{kj} \), then \( x_{j} \) is classified into the i-th cluster, where \( u_{ij} \) represents the membership degree of the sample \( x_{j} \) to the cluster center \( v_{i} \).

The FCM algorithm [24] is a point-by-point iterative clustering algorithm based on the sum of squared errors as a criterion function. This iterative process starts from a random cluster center. In order to find the minimum value of the objective function \( J\left( {X,U,V} \right) \), the cluster center V and the membership matrix U are iteratively calculated by Eq. (3) and Eq. (4). Therefore, the value of the objective function \( J\left( {X,U,V} \right) \) is continuously reduced until the value is minimized. Generally, the convergence condition of the algorithm is that the difference between the objective functions of two iterations is less than the threshold \( \eta \), or the specified number of iterations is reached. When the objective function is minimized, the final clustering result of the data samples is obtained, that is, the cluster center \( V \) and the membership matrix \( U \) after the fuzzy division. Then the purpose of determining the sample category is achieved.

3 Fuzzy C-means Clustering Algorithm Based on Local Density (LD-FCM)

3.1 Selection of Initial Cluster Center

The traditional FCM clustering algorithm is a sort of local optimal search algorithm. FCM algorithm has some imperfections, such as being sensitive to the initial cluster center and tending to be trapped in the local optimal solution. The random selection of the initial cluster center of the FCM algorithm will lead to the unstable clustering results that are different each time. Therefore, the effect of the clustering may not be best every time, which limits the application of the algorithm. In order to solve the above problems, the paper improve the selection of initial cluster centers in the FCM algorithm. We propose a novel fuzzy C-means clustering algorithm based on local density (LD-FCM). LD-FCM algorithm calculates the local density \( \rho_{i} \) of all sample points in the algorithm and select the sample point with the local maximum density as the initial cluster center by using the distance matrix D and the distance threshold α in each iteration. In this way, the selection of the initial cluster center not only ensures the compactness of the objects in the same cluster, but also ensures the separation of the cluster centers [25]. The specific improvements on the selection of the initial cluster center are as follows:

Supposed that the data sample set \( X = \left\{ {x_{1} ,x_{2} , \ldots ,x_{n} } \right\} \subset R^{s} \) is an s-dimensional data set in Euclidean space, n is the number of samples. Where \( x_{i} \) contains the s dimensions, which is expressed as \( x_{i} = \left\{ {x_{i}^{1} , \ldots ,x_{i}^{d} , \ldots ,x_{i}^{s} } \right\} \). LD-FCM algorithm divides X into C classes (\( 2 \le C \le n \)), and has C cluster centers \( V = \left\{ {v_{1} ,v_{2} , \ldots ,v_{c} } \right\} \).

  • Step 1: Calculate the distance between any two samples according to Euclidean Distance Eq. (5), and generate a distance matrix D

    $$ d_{ij} = \sqrt {\sum\nolimits_{d = 1}^{s} {\left( {x_{i}^{d} - x_{j}^{d} } \right)^{2} } } $$
    (5)
  • Step 2: Calculate the local density \( \rho_{i} \) of the data object \( x_{i} \) according to Eq. (6)

    $$ \uprho_{i} = \sum\nolimits_{{x_{j} \in X}} {\upchi\left( {d_{ij} - d_{c} } \right)} $$
    (6)
  • Where \( d_{c } \) represents the truncation distance. \( \upchi\left( x \right) = \left\{ {\begin{array}{*{20}c} {1,\quad x \le 0} \\ {0, \quad x > 0} \\ \end{array} } \right. \), the meaning of this equation is to count the number of data points whose distance to the i-th data point is less than the truncation distance \( d_{c} \), and use it as the local density of the i-th data point.

  • Step 3: Arrange the local density of each sample point from large to small: \( \rho_{i} > \rho_{j} > \rho_{k} > \cdots > \rho_{n} \), and take the sample point with the local maximum density \( \rho_{i} \) as the first cluster center \( v_{1} \).

  • Step 4: Select the distance threshold α, then find all samples whose distance from the first cluster center \( v_{1} \) is greater than α by using the distance matrix D. And select the sample point with the highest local density among these samples as the second cluster center \( v_{2} \).

  • Step 5: Similarly, find all samples whose distance from the found sample points is greater than α in the remaining samples, and select the sample point with the highest local density among these samples as the third cluster center \( v_{3} \).

  • Step 6: Repeat Step 5 until C clusters are found. In this way, C initial cluster centers will be obtained.

3.2 The LD-FCM Algorithm

LD-FCM clustering algorithm is divided into two stages. In the first stage, the method of local maximum density and the distance threshold α are used to select the initial cluster center. In the second stage, the FCM algorithm is performed with the initial cluster center that obtained in the first stage. The specific steps of the LD-FCM algorithm are as follows.

  • Step 1: Calculate the distance between any two samples according to Euclidean Distance Eq. (5), and generate a distance matrix D;

  • Step 2: Calculate the local density \( \rho_{i} \) of the data object \( x_{i} \) according to Eq. (6). Arrange the local density of each sample point from large to small: \( \rho_{i} > \rho_{j} > \rho_{k} > \cdots > \rho_{n} \), and take the sample point with the highest local density \( \rho_{i} \) as the first cluster center \( v_{1} \);

  • Step 3: Select the distance threshold α (\( \alpha > 0 \)), then find all samples whose distance from the first cluster center \( v_{1} \) is greater than α by using the distance matrix D. And select the sample point with the highest local density among these samples as the second cluster center \( v_{2} \);

  • Step 4: Similarly, find all samples whose distance from the found sample points is greater than α in the remaining samples, and select the sample point with the highest local density among these samples as the third cluster center \( v_{3} \);

  • Step 5: Repeat Step 4 until C clusters are found. In this way, C initial cluster centers \( v_{i} (k), (i = 1,2, \ldots ,C) \) iterations will be obtained;

  • Step 6: Set the number of iterations k = 1, and use the result of Step 5 as the initial cluster center \( v_{i} (k), (i = 1,2, \ldots ,C) \);

  • Step 7: The membership matrix \( U^{{\left( {\gamma + 1} \right)}} = (u_{ij} )_{c \times n} \) (i = 1, …, c, j = 1, …, n) is updated according to initial cluster center \( v_{i} \left( k \right) \) and Eq. (4);

  • Step 8: The cluster center \( V^{{\left( {\gamma + 1} \right)}} = \left\{ {v_{1} ,v_{2} , \ldots ,v_{c} } \right\} \) is updated according to Eq. (3);

  • Step 9: We calculate \( e = \left\| {U^{{\left( {\gamma + 1} \right)}} - U^{\left( \gamma \right)} } \right\| \). If \( e \le \eta \) (\( \eta \) is the threshold, generally 0.001 to 0.01), then the algorithm stops; Otherwise, \( \gamma = \gamma + 1 \), go to Step 7.

  • Step 10: The samples are classified and output according to the final membership matrix U. If the sample \( x_{j} \) satisfies \( u_{ij} > u_{kj} \), then \( x_{j} \) is classified into the i-th cluster, where \( u_{ij} \) represents the membership degree of the sample \( x_{j} \) to the cluster center \( v_{i} \).

4 Results

In order to test the effect of the LD-FCM algorithm, we have used several artificial datasets and real datasets in UCI for experiments. We also compared the LD-FCM algorithm proposed in this paper with FCM algorithm, K-means algorithm, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm and DP-FCM (Density peak-based FCM) algorithm. The K-means algorithm and the DBSCAN algorithm are classic algorithms in the partition-based clustering algorithm and the density-based clustering algorithm, respectively. In the K-means [26] algorithm, each cluster is represented by the mean of the objects in the corresponding cluster. However, K-means algorithm can only obtain local optimal solution by adopting iterative algorithms. In addition, K-means algorithm work well when analyzing small and medium-sized data sets to find circular or spherical clusters. But K-means algorithm perform poorly when analyzing large-scale data sets or complex data types, so they need to be extended [27,28,29]. DBSCAN [30] algorithm derives the maximum density connected set according to the density reachability relationship, which is a category or cluster for our final clustering. DBSCAN algorithm can divide areas with enough density into clusters, and find clusters of arbitrary shape in the spatial database with noise [31]. The DP-FCM algorithm is proposed by Liu et al. [19]. Firstly, DPC algorithm is used to automatically select the center and number of clusters, and then FCM algorithm is used to realize clustering.

4.1 The Datasets and Evaluation Indexes of Experiment

The information of the experimental datasets is shown in Table 1 and Table 2, where Table 1 is artificial datasets and Table 2 is a real datasets in UCI [32]. These datasets have some discrimination in the number of attributes and the number of clusters.

Table 1. Description of the artificial datasets.
Table 2. Description of the real datasets in UCI.

The evaluation indexes of the experimental results are Accuracy, NMI (Normalized Mutual Information) and ARI (Adjusted Rand index) [33, 34]. Accuracy is the number of right samples divided by the total number of samples. NMI is an information measure in information theory, and its value range is [0, 1]. ARI is the goodness of fit which measures the distribution of two data, and its value range is [−1, 1]. The larger the values of the three evaluation indexes are, the better the clustering result is. Their definitions are as follows.

$$ ACC = \frac{{\sum\nolimits_{i = 1}^{k} {a_{i} } }}{\left| U \right|} $$
(7)

Where K is the number of clusters, \( a_{i} \) is the number of samples correctly classified into \( C_{i} \), and U is the all samples.

$$ I\left( {X,Y} \right) = \sum\nolimits_{h = 1}^{k\left( a \right)} {\sum\nolimits_{l = 1}^{k\left( b \right)} {n_{h \cdot l} log\left( {\frac{{n \cdot n_{h \cdot l} }}{{n_{h}^{\left( a \right)} n_{l}^{b} }}} \right)} } $$
(8)
$$ H\left( X \right) = \sum\nolimits_{h = 1}^{k\left( a \right)} {n_{h}^{\left( a \right)} log\frac{{n_{h}^{\left( a \right)} }}{n}} $$
(9)
$$ H\left( Y \right) = \sum\nolimits_{l = 1}^{k\left( b \right)} {n_{l}^{\left( b \right)} log\frac{{n_{l}^{\left( b \right)} }}{n}} $$
(10)
$$ NMI\left( {X,Y} \right) = \frac{{I\left( {X,Y} \right)}}{{\sqrt {H\left( X \right)H\left( Y \right)} }} $$
(11)

Where X and Y are the random variables, I(X; Y) represents the mutual information of two variables, and H(X) is the entropy of X.

$$ RI = \frac{a + b}{{C_{2}^{{n_{samples} }} }} $$
(12)
$$ ARI = \frac{{RI - E\left| {RI} \right|}}{{max\left( {RI} \right) - E\left| {RI} \right|}} $$
(13)

Where C is the actual category information. K is the clustering result. a represents the logarithms of elements of the same categories in both C and K. b represents the logarithms of elements of the different categories in both C and K. \( C_{2}^{{n_{samples} }} \) represents the logarithm that can be formed in the datasets. RI represents the Rand index. E is the expectation. max() is the function to find the maximum value.

4.2 Results on the Artificial Datasets

As shown in Fig. 1(a)–(f), it shows the clustering results of the LD-FCM algorithm on six different artificial datasets. The datasets including Set, R15, Shape, Sizes, Twenty and Target. Figure 1 shows that the LD-FCM algorithm can correctly cluster the datasets with spherical or elliptical shapes. The experimental results show that the LD-FCM algorithm is very effective in seeking clusters with any shape, density, distribution and number. LD-FCM algorithm solves the disadvantages of the original algorithm. LD-FCM algorithm can reasonably select the initial cluster center, then correctly calculate the membership of each sample, and each clustering result is relatively stable.

Fig. 1.
figure 1

Clustering result graph.

4.3 Results on the Real Datasets in UCI

The experimental results of the real datasets in UCI are shown in Table 3, Table 4 and Table 5. These tables show the Accuracy, NMI and ARI of each clustering result in every algorithm. The experimental results are showed by the way of percentages. The numbers highlighted in bold in the three tables indicate the best performance in the evaluation aspect.

Table 3. Comparison of accuracy of clustering algorithms.
Table 4. Comparison of NMI of clustering algorithms.
Table 5. Comparison of ARI of clustering algorithms.

Based on the three evaluation indicators, it can be found that the clustering results of the LD-FCM algorithm are generally better than the other four algorithms. LD-FCM algorithm can effectively cluster the data set and produces improved clustering results. For datasets with a large number of clusters and high dimensions, the LD-FCM algorithm can still effectively cluster the datasets. In addition, when there are many noise samples in the data set and the boundaries between clusters are not clear, the LD-FCM algorithm is more suitable, which also proves the effectiveness of the LD-FCM algorithm.

Table 6 shows the time performance of each algorithm on different datasets. From the experimental results, we can see that the algorithm in this paper inherits the time advantage of FCM algorithm, and running time takes less time than other algorithms, so it has some advantages in both running time and memory consumption.

Table 6. Comparison of running time of clustering algorithms (seconds/s).

5 Conclusion

Firstly, this paper introduces the concepts of fuzzy clustering. The principle of fuzzy C-means clustering algorithm is explained. Considering that the selection of the initial cluster center of the traditional FCM clustering algorithm is random, which will lead to the unstable clustering results and the clustering effect may not be the best each time. In order to solve the above problems, a novel fuzzy C-means clustering algorithm based on local density is proposed in this paper. This algorithm uses the local maximum density of sample points to improve the selection of the initial cluster center, which reduces the number of iterations and avoids falling into a local optimal solution. Through specific experiments, better clustering results are obtained by using artificial datasets. The real datasets in UCI were used to analyze the algorithm from three evaluation indexes about Accuracy, NMI, ARI and running time of real datasets. The experimental results show that the LD-FCM algorithm is better than FCM algorithm, K-means algorithm, DBSCAN algorithm and DP-FCM algorithm. Running time of LD-FCM algorithm takes less time than other algorithms. Therefore, it can be concluded that the LD-FCM algorithm has good effectiveness and robustness.