1 Introduction

Cluster analysis is an unsupervised learning method and aims to group a set of unlabeled objects into clusters, such that each cluster contains similar objects and different objects are in different clusters [58]. Clustering can help people analyze data and solve practical problems, and thus it has been widely used in disparate fields, including statistical analysis [2, 3, 16, 23], pattern recognition [14, 42, 64], information retrieval [15, 26], and bioinformatics [13, 17]. The traditional clustering algorithms can be divided into different categories, such as partition methods [7, 24], hierarchical algorithms [50], density-based algorithms [4, 29], and graph-based methods [11], and so on.

The density-based algorithms, such as DBSCAN (density-based spatial clustering of applications with noise) [29], OPTICS (ordering points to identify the clustering structure) [4], GDD (Gaussian Density Distance) [33] attempt to discover high density clusters separated by low density regions. The data points located in low-density regions are typically considered as outliers. Removing the outliers before executing a clustering task can potentially improve the clustering performance.

By introducing the concept of the k-reverse nearest neighbors (kRNNs), Vadapalli et al. [56] proposed reverse nearest neighbor based clustering and outlier detection algorithm (RECORD). RECORD counts the number of k-reverse nearest neighbors of data points, and takes a point that has less than k number of kRNNs as an outlier point. After removing all outlier points, the remaining points are clustered. Similarly, many algorithms exploit the kRNNs to further optimize the DBSCAN algorithm, such as IS-DBSCAN (enhancing density-based clustering algorithm) [10], ISB-DBSCAN (efficient and scalable density-based clustering algorithm) [46], and RNN-DBSCAN (density-based clustering algorithm using reverse nearest neighbor density estimates algorithm) [9], etc. These variants of DBSCAN use reverse nearest neighbor counts as the density of data point. However, they have not considered how to develop the clustering algorithm with the organic combination of multiple neighbor relations.

As one of the classical clustering algorithms, the core idea of hierarchical clustering is to produce a set of nested clusters organized as a hierarchical tree by investigating the similarity between subclusters. The hierarchical clustering algorithms can be divided into agglomerative algorithms and divisive algorithms. The advantage of hierarchical clustering is that it produces a hierarchical partition of the data at different levels of granularity. However, the hierarchical clustering suffers from the disadvantage that it cannot repair the previous merging decision. In addition, hierarchical clustering algorithm has high computational complexity of O(n3). Recently, several variants of the hierarchical clustering algorithm have been studied. Yu et al. proposed a minimum spanning tree (MST)-based agglomerative hierarchical clustering (TAHC) method [61] to detect clusters in artificial trees. Jeon et al. [38] proposed a new linkage method (NC-Link) for efficient hierarchical clustering of large-scale data. Bouguettaya et al. [8] presented a methodology (KnA) of combining agglomerative hierarchical clustering and partitional clustering to reduce the running cost. In addition, many hierarchical algorithms based on neighbor relation were proposed. Chameleon clustering algorithm [39] models the data points with a k-nearest neighbor graph which are further partitioned into a number of small sub-clusters. Lai et al. [31, 40] proposed an agglomerative clustering algorithm using a dynamic k-nearest-neighbor list to reduce the computational complexity of Ward’s method. Ma et al. [48] proposed a three-stage MST-based hierarchical clustering algorithm, in which the neighbor relationship between the points is modeled with minimum spanning tree. The reciprocal-nearest-neighbors supported hierarchical clustering algorithm (RSC) [57] is based on the idea that two reciprocal nearest data points should be grouped in one cluster. One specific neighbor relation has been used in the above algorithms. However, to our knowledge, how to organically combine various neighbor relations in the hierarchical clustering algorithm has not been explored in the literature.

In clustering analysis, similarity measure between data points or subclusters plays an important role in the process of clustering [18]. Many clustering algorithms use distance-based methods to measure the similarity, including Euclidean distance or cosine distance between the data points, and the average distance or maximum distance between subclusters, and so on. In recent years, the neighbor relation [51, 60] has been applied to measure the similarity. The shared nearest neighbor (SNN) algorithm [28, 37] developed a similarity measure between the points based on the number of the nearest neighbors they share. The reverse nearest neighbor based clustering (RECORD) algorithm [56] uses a reverse nearest neighborhood set to detect the outliers. Sarfraz et al. [55] proposed a parameter-free clustering algorithm (FINCH) using first neighbor relation. Qin et al. [51] proposed a clustering method based on hybrid k-nearest-neighbor (CHKNN) to find and merge the high-density regions. The key idea of the above similarity measures is to take the data point together with its neighbors as a whole and evaluate the similarity of two points by their neighbors’ similarities instead of the distance between two points. Although the above algorithms proposed various neighborhood-based similarity measures, most of them have difficulty in processing complex data sets. For SNN, RECORD and CHKNN, it is difficult for a user to tune the parameters in realistic setting. Moreover, most neighborhood-based algorithms are limited to work with at most one or two neighbor relations.

In this study we propose a three-stage hierarchical clustering algorithm based on the nearest neighbor relation. Step 1: Outlier detection and removal. To alleviate the impact of outliers on the cluster structure, the reverse nearest neighbor method is used to detect and remove the outliers in the data set; Step 2: Preliminary clustering. The data points with stable connection on 1-nearest neighbor graph are merged to form small clusters; Step 3: Subcluster merging. By introducing the concepts of linked representatives and expanded representatives of two clusters, we propose a new measure of intercluster distance based on the representative points of two clusters. Based on this measure, we keep merging the closest pair of clusters until K clusters left. According to the extensive experiments on several synthetic and real data sets, the NTHC algorithm provides more accurate results than the state-of-the-art methods.

A valuable feature of the proposed method consists in its organic combination of various types of neighborhoods, which improves both the effectiveness and efficiency of the algorithm. Compared to the existing clustering methods, the proposed method can find clusters with different sizes, shapes and densities.

The main contributions of this paper are highlighted as follows.

  1. (1)

    Based on the idea of various types of neighborhoods, we define three concepts: the stability of data point pair, the linked representatives, and the expanded representatives.

  2. (2)

    We design a new measure of intercluster distance based on the linked representatives and extended representatives.

  3. (3)

    A clustering algorithm termed NTHC is developed based on the new measure of intercluster distance. The NTHC method can deal with clusters with different densities and shapes.

  4. (4)

    All the values of the three parameters (including k, thr, and ths) involved in the algorithm are fixed, thus the parameter tuning is not required.

The rest of the paper is outlined as follows. In Section 2, we discuss the outlier detection and neighborhood-based clustering algorithms. In Section 3, we present the proposed algorithm. In Section 4, experimental results are demonstrated to show the effectiveness of the proposed method. Some concluding remarks are given in Section 5.

2 Related work

2.1 Outlier detection

The concept of “outlier” was first proposed by Grubbs [43]: “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs”. Although there is no strict definition for outliers, it is accepted by researchers that outliers exhibit “deviating characteristics”. Many researchers use such characteristics to improve clustering results by means of outlier detection.

In the research of clustering, outlier detection is commonly used in density-based clustering algorithms. Some clustering algorithms, such as DBSCAN and SNN, detect the outliers based on the densities of data points. In DBSCAN algorithm, the density of the data point xi is obtained by counting the number of data points within its ε-neighborhood. The point with a density under a specified threshold, MinPts, is considered as an outlier and discarded later as noise. On the basis of DBSCAN, SNN redefines the density of data point using the shared nearest neighbors and further improves the DBSCAN algorithm. However, the two algorithms cannot work properly for multi density clusters.

RECORD algorithm regards the point with fewer kRNNs as an outlier. Similar algorithms, such as IS-DBSCAN algorithm, ISB-DBSCAN algorithm, and RNN-DBSCAN algorithm, also use kRNNs concept to detect the outliers. This kind of outlier detection method depends on the reverse nearest neighbors and is more flexible in dealing with clusters with different densities.

The outliers are kept far away from the cluster center, and lie on the boundary of the cluster. The outliers can have some effects on the accuracy of the clustering results, so it is beneficial to improve the clustering performance by removing the outliers. Figure 1(a) represents the outliers (denoted in red) identified by the proposed algorithm. Later in Section 3, we will discuss in detail about our algorithm. Figure 1(b) represents the clusters after eliminating outliers. It can be seen from Fig. 1 that eliminating outliers is a necessary preprocessing step in clustering, which can make the distribution of data points in the same cluster much closer.

Fig. 1
figure 1

Illustration of outlier detection and removal. (a) The outliers (denoted in red) identified by the proposed algorithm. (b) The clusters after outlier removal

2.2 Neighborhood-based clustering

Similarity measure based on shared nearest neighbors has been used to improve the performance of various types of clustering algorithms, including spectral clustering [21, 25], density peaks clustering [44, 47], k-means [20] and so on. As for hierarchical clustering, k-nearest-neighbor list is incorporated to reduce the computational complexity of Ward’s method. In addition, the idea of k nearest neighbors has been introduced into density peaks clustering for the local density computation [27]. Recently, the concept of reciprocal nearest neighbors was investigated for the development of clustering algorithm. This concept is used to identify dense regions [1], discover the potential core clusters [54], distinguish the cluster membership of each point [30], and so on. Besides, to improve the performance of DBSCAN, reverse nearest neighbor methods such as RECORD, IS-DBSCAN, ISB-DBSCAN, and RNN-DBSCAN, define the density of point using reverse nearest neighbors. These methods require one single parameter k, the number of nearest neighbors.

The definition of a neighborhood can be categorized into two major types. The first one counts the number of points in a hypersphere of radius r. The second one is k-nearest neighbors, which has a wide range of applications in classification, clustering and outlier detection [9, 12, 51, 57]. Its main idea is to find the k nearest neighbors for all points in the data set. Assume that XN × d = {x1, x2, Λ, xN} is the data set X with N data points and d dimensions. The distance matrix D of data set X is a N × N two-dimensional matrix, where D (xi, xj) is the Euclidean distance between points xi and xj:

$$ D\left({x}_i,{x}_j\right)=\sqrt{\sum \limits_{m=1}^d{\left({x}_{im}-{x}_{jm}\right)}^2} $$
(1)

Starting with the distance matrix D, we can find the k nearest neighbors of xi, denoted by kNN (xi):

$$ kNN\left({x}_i\right)=\left\{{x}_j\left|D\right.\left({x}_i,{x}_j\right)\le D\left({x}_i,{NN}_k\left({x}_i\right)\right)\right\} $$
(2)

where NNk (xi) is the kth nearest neighbor of xi.

After determining the k nearest neighbors of data point, the number of shared neighbors between two points can be used to define the similarity between points:

$$ similarity\ \left({x}_i,{x}_j\right)= size\left( kNN\left({x}_i\right)\cap kNN\left({x}_j\right)\right) $$
(3)

According to the clustering algorithm based on the shared nearest neighbors (SNN), the more neighbors the two points share, the more similar they are. SNN algorithm defines the similarity between a pair of points in terms of how many nearest neighbors the two points share. Based on the definition of similarity, SNN algorithm identifies core points and then builds clusters around the core points. And all non-core points that are not within a radius of Eps of a core point are discarded. Ye et al. [60] improved the SNN algorithm by considering both the number of shared nearest neighbors and the distance between data points.

In general, the number of shared nearest neighbors is compared to a pre-specified threshold in the practical application, from which we can evaluate the similarity between two points. Here, the threshold value plays a vital role in the similarity comparison. And too large or too small values will affect the performance of final clustering result.

For data point xi, its k-reverse nearest neighbors can be defined as:

$$ kRNNs\left({x}_i\right)=\left\{{x}_j\left|D\right.\left({x}_i,{x}_j\right)\le D\left({x}_j,{NN}_k\left({x}_j\right)\right)\right\} $$
(4)

where NNk(xj) is the kth nearest neighbor of xj. Obviously, if xi is one of k nearest neighbors of xj, xj is one of k-reverse nearest neighbors of xi, that is, xi ∈ kNNxj ⇔ xj ∈ kRNNs(xi).

As discussed in [51], the reciprocal neighbor relation is defined as:

$$ {NN}_1\left({x}_i\right)={x}_j\wedge {NN}_1\left({x}_j\right)={x}_i $$
(5)

where NN1(xi) and NN1(xj) denote the first nearest neighbor of xi and xj, respectively.

Table 1 shows the main strengths and shortcomings of the various neighborhood based clustering algorithms. This table evaluates the methods using 11 properties, including K-nearest neighbor, Shared nearest neighbors, K-reverse nearest neighbors, Reciprocal-nearest-neighbors, Detection of outliers, Number of adjustable parameters, Computational complexity, Varying densities, Irregular shapes, Varying sizes, and Parameter sensitivity. K-nearest neighbor, Shared nearest neighbors, K-reverse nearest neighbors and Reciprocal-nearest-neighbors are used to show whether the type of neighbor structure is used in the algorithms (“√” denotes used and “×” denotes not used). Detection of outliers includes three values: able(√), unable(×), and partial(△) to show the outlier detection capability of the clustering algorithm. △ represents that the clustering algorithm can detect outliers for partial data sets. Number of adjustable parameters shows the number of user-defined parameters in the clustering algorithm. Computational complexity shows the amount of resources needed to perform certain computations. Varying densities, Irregular shapes and Varying sizes refer to the ability of the algorithm to discover clusters of various densities, irregular shapes and various sizes, respectively. They all include three values: able(√), unable(×), and partial(△). Parameter sensitivity evaluates whether the algorithm is sensitive to the present parameters.

Table 1 Comparison of clustering algorithms involving various neighbor relations. Note that N is the number of data points, k is the number of nearest neighbors, where’x’ refers to unable,’√’ means able and’△’ is partial

From Table 1, we can see that several methods such as SNN0, HC-MNN, Chameleon, SNN, and RECORD are sensitive to the parameters. In contrast, for NC, DAN and FINCH, there are no parameters required to tune. All of the three methods use reciprocal nearest neighbors. But both of NC and DAN require O(n3) computations. Four variants of DBSCAN (IS-DBSCAN, ISB-DBSCAN, RNN-DBSCAN, and ADBSCAN) can identify the clusters of varying sizes and irregular shapes. And three of them use k-reverse nearest neighbors. Six methods (HC-MNN, Chameleon, DLA, DKNNA, MUNEC, and RSC) are hierarchical clustering algorithms. To reduce the computational complexity of traditional hierarchical clustering method, various neighbor relations (k-reverse nearest neighbors, reciprocal nearest neighbors) are used in the DLA, DKNNA, and RSC methods. Although these methods reduce significantly the computational complexity of hierarchical clustering, there are still some problems among these methods. For example, the performance of these methods will be degraded if there are outliers in the clusters. Furthermore, most of them cannot guarantee the results when the clusters have large variations of densities. Therefore, the goal of this paper is to incorporate different neighbor relations into hierarchical clustering while guaranteeing efficiency and effectiveness.

3 The proposed algorithm

In this section, we present a detailed description of the NTHC algorithm. First, we define the stability of the data point pair and the intercluster distance based on the linked and extended representatives. Then, we propose the NTHC algorithm, which includes outlier detection and removal, preliminary clustering and subcluster merging. Finally, we present the analysis of the time complexities of the NTHC algorithm.

3.1 Definitions

The number of kRNNs for each point xi in the data set X can be represented by Eq. (6).

  • Definition 1 (the number of k-reverse nearest neighbors). For any point xi in data set X, the set of k-reverse nearest neighbors of point xi is kRNNs(xi); the number of kRNNs of point xi can be expressed as

$$ r\left({x}_i\right)=\mid kRNNs\left({x}_i\right)\mid $$
(6)

Then, we construct the nearest neighbor graph G1-NN as follows:

  • Definition 2 (1-Nearest Neighbor Graph). Let G1-NN be a 1-nearest neighbor graph in which xi and xj are connected by a directional edge, if xj is the first nearest neighbor NN1(xi) of xi.

There exist two cases with respect to the connection relation on G1-NN:

  1. (1)

    If the data points xi and xj are the first nearest neighbors to each other, the edge connecting xi and xj is bidirectional. Here, this structure, composed of vertex xi, xj, and the corresponding bidirectional edge eij, is considered as stable. Thus, vertices xi and xj can be grouped into the same cluster.

Figure 2 illustrates an example of G1-NN, where the points that are the first nearest neighbors to each other can be merged into one cluster (marked with red dashed circle).

  1. (1)

    If xj is the first nearest neighbor of xi, and xi is not the first nearest neighbor of xj, the edge (xi, xj) is a one-way edge. This one-way relation cannot guarantee the stability of structure. As shown in Fig. 2, m is the first nearest neighbor of n, but n is not the first nearest neighbor of m. In this case, we can further judge with the shared nearest neighbors whether this one-way relation is stable or not.

Fig. 2
figure 2

An example of the nearest neighbor graph. Each point represents a data point. The one-way edge (m, n) represents that m is the first nearest neighbor of n

Based on the above discussion, we introduce the definition of the stability of the data point pair:

  • Definition 3 (Stability of the data point pair) The stability of the data point pair s(xi, xj) is defined as follows:

$$ s\left({x}_i,{x}_j\right)=\left\{\begin{array}{c}1, if\left({NN}_1\left({x}_i\right)={x}_j\wedge {NN}_1\left({x}_j\right)={x}_i\right)\vee \left( size\left( kNN\left({x}_i\right)\cap kNN\left({x}_j\right)\right)\ge {th}_s\cdot k\right)\\ {}0, else\end{array}\right. $$
(7)

where k is the value of k in k nearest neighbors, NN1(xi) and NN1(xj) denote the first nearest neighbor of xi and xj, respectively, and kNN(xi) and kNN(xj) denote the k nearest neighbors of xi and xj, respectively.

Let Ci and Cj denote two clusters, respectively, and we define the linked representatives of Ci and Cj as follows:

  • Definition 4 (The set of linked representatives of Ci relative to Cj) The set of linked representatives of Ci relative to Cj can be defined as follows:

$$ {R}_{i\left|j\right.}=\left\{{x}_i\left|{x}_i\in {C}_i,\exists {x}_j\in {C}_j,{x}_i\in kNN\left({x}_j\right)\wedge {x}_j\in kNN\right.\left({x}_i\right)\right\} $$
(8)

According to Definition 4, we can select the data point pairs satisfying k shared nearest neighbors in Ci and Cj as the linked representatives of those two clusters, respectively. As an example, in Fig. 3, the sets of data points marked with green circles and purple circles represent Ri|j and Rj|i, respectively. Since there are few point pairs satisfying k shared nearest neighbors, we further define the extended representatives to ensure the number of representatives without losing accuracy.

  • Definition 5 (The set of extended representatives) Denote \( {R}_{i\left|j\right.}^{\prime } \) be the set of extended representatives of Ci relative to Ri|j, which can be defined as follows:

Fig. 3
figure 3

An example of the set of representatives. The green circles and purple circles represent the linked representative points of Ci and Cj, respectively, and the red circles and orange circles represent the extended representative points of Ci and Cj, respectively

$$ {R}_{i\left|j\right.}^{\prime }=\left\{{x}_i\left|{x}_i\right.\in {C}_i\backslash {R}_{i\left|j\right.},\exists {x}_j\in {R}_{i\left|j\right.},{x}_i\in kNN\left({x}_j\right)\right\} $$
(9)

Definition 5 is less restrictive than Definition 4, that is, it only requires the nearest neighbor relation in Definition 5. As shown in Fig. 3, the sets of data points marked with red circles and orange circles represent \( {R}_{i\left|j\right.}^{\prime } \) and \( {R}_{j\left|i\right.}^{\prime } \), respectively.

Next, we define the intercluster distance based on representatives.

  • Definition 6 (Intercluster distance) The intercluster distance of Ci and Cj is defined as follows:

$$ d\left({C}_i,{C}_j\right)=\frac{\sum \limits_{x_{i\in {R}_i}}\sum \limits_{x_{j\in {R}_j}}D\left({x}_i,{x}_j\right)}{\left|{R}_i\right|\times \left|{R}_j\right|} $$
(10)

where \( {R}_i={R}_{i\left|j\right.}\cup {R}_{i\left|j\right.}^{\prime } \), \( {R}_j={R}_{j\left|i\right.}\cup {R}_{j\left|i\right.}^{\prime } \), and D(xi, xj) is the Euclidean distance between xi and xj.

3.2 Processes

The overall process is divided into three steps: outlier detection and removal, preliminary clustering and subcluster merging. The detailed algorithm process is shown below.

figure d

Lines 1 ~ 2 of the NTHC algorithm refer to the step of outlier detection and removal. All data points are sorted in ascending order by r(xi). The first N × thr data points are regarded as outliers and eliminated from the data set X. Here, threshold thr has a range of [0,1). When thr is equal to 0, it is equivalent to not removing any data points. And when thr is close to 1, it is equivalent to considering all data points as outliers. The goal of eliminating outliers is to ensure the accuracy of subsequent clustering. However, the cluster shape will be changed if too many points are taken as outliers. In this paper we set thr = 0.2, which will be discussed in detail in Section 4.5.

Lines 3 ~ 5 of the NTHC algorithm refer to the step of preliminary clustering. According to Eq. (7), the edge (xi, xj) is considered as stable structure if s(xi, xj) is equal to 1. For all the edges in G1-NN, we retain the edges with stable structure and remove the remaining edges. Finally, the data points connected by edges are merged into the set of subclusters.

We count the number of shared neighbors between two points xi and xj according to Eq. (7). If the number of shared neighbors is greater than ths ∙ k, this one-way structure is considered as stable and the points xi and xj can be merged into the same cluster. Here, k represents the value of k in k nearest neighbors, and ths is a threshold in the range (0, 1). In this paper we set a constant value of ths to avoid the parameter adjustment. The principles of setting ths are as follows. (1) The threshold ths should not be too small. If the value of ths is too small, more data point pairs will be regarded as stable pairs and further merged into the same cluster. And a wrong merge will have a certain impact on the subsequent clustering merging. (2) The threshold ths can be set as a larger value. If the value of ths is too large, some data point pairs, which should be merged into the same cluster, will be misjudged as unstable pairs and grouped into different clusters. However, this kind of error can be fixed to a certain extent in the subsequent subcluster merging.

Lines 6 ~ 8 of the NTHC algorithm refer to the step of subcluster merging. We merge successively the cluster pairs with the minimum value of intercluster distance according to Definition 6 until the final clustering result is obtained.

For subcluster merging, it is a crucial problem that how to define the similarity between two subclusters. If there is a problem with an intercluster similarity measure, incorrect merge will affect the final clustering result. The traditional intercluster similarity measures include single linkage, complete linkage and average linkage. Both single linkage and complete linkage select a data point to represent the subcluster, which makes the result sensitive to the representatives. And for average linkage, all points in the cluster participate in the calculation of distance, so the result is sensitive to the shape and density of cluster.

In cciMST (a clustering algorithm based on minimum spanning tree and cluster centers) algorithm [56], the data points at the intersection of clusters are taken as the representatives and the average distance of all the representative pairs is regarded as the intercluster distance. This method has some effect, but its efficiency is lower because it needs to calculate and sort the distance of all data point pairs. Inspired by this method, this paper takes the intercluster nearest neighbors as the representatives and calculates the representative pairwise distance to measure the intercluster similarity according to Eq. (10). The intercluster nearest neighbors are mostly located at the intersection of clusters. In addition, the k-nearest neighbors of all the data points are calculated in the initialization stage. Therefore, we can avoid the sorting operation when calculating the intercluster distance, which can further improve the algorithm efficiency.

3.3 Complexity analysis

The detailed analysis of Algorithm 1 is as follows:

Lines 1 ~ 2: The distance calculation of all data point pairs results in complexity of O(N2). The complexity of computing the nearest neighbor list can be optimized to O(NlogN) using the algorithms including R* tree [6], K-D tree [12] and ball tree [22], etc. The proposed algorithm is designed upon the neighbor relations involving k nearest neighbors, reverse nearest neighbors, 1-nearest neighbor, and shared nearest neighbors. After constructing the k-nearest neighbor list, other neighbor relations can be obtained from it. The number of k reverse nearest neighbors r(xi) can be obtained from the k-nearest neighbor list, which requires O(N). Sorting the data points according to the value of r(xi) needs O(NlogN).

Lines 3 ~ 5: In k-nearest neighbor list, the neighbors of the data point are sorted according to the distances from the data point to its neighbors. Therefore, the information of the first nearest neighbors can be obtained from the k-nearest neighbor list. Next, each edge in G1-NN is judged whether it is stable according to Eq. (7). Since the bidirectional edge is a stable structure, we need only to count the number of shared nearest neighbors of two endpoints of the one-way edge. Therefore, the time complexity of preliminary clustering is about O(N).

Lines 6 ~ 8: Suppose that the number of clusters to be merged is M. In addition, the representatives of cluster pairs can be obtained from k-nearest neighbor list. Next, we calculate the intercluster distance according to Eq. (10), where the distance between the data point pairs is already calculated in the initialization stage. Since the size of M and the number of representatives are indeterminate values, the approximate time complexity of subcluster merging is O(N2).

Thus, the overall time complexity of the proposed algorithm is O(N2).

4 Experiment

4.1 Experiment procedure

In this section, we evaluate the proposed algorithm on five synthetic data sets DS1-DS5 [11, 52, 62, 63] and eight real data sets, including Glass, Zoo, Ecoli, Breast-Cancer, Segment, SCADI, SECOM, and CNAE9. The eight real data sets are taken from the UCI data sets [5], where SCADI, SECOM and CNAE9 are high dimensional data sets. The descriptions of these data sets are shown in Table 2 and Fig. 4. And we employ the Friedman test [19] to assess the statistically significant differences among the sixteen clustering algorithms. Moreover, we further discuss the value of three parameters involved in the proposed algorithm, including k, thr, and ths. We test the parameters on the last six data sets in Table 2, including Soybean, Wine, WheatSeeds, DrivFace, Pendigits and SpamBase.

Table 2 Description of the five synthetic data sets and the fourteen real data sets
Fig. 4
figure 4

Five synthetic data sets

NTHC is compared to the following fifteen clustering algorithms:

  1. 1.

    k-means [34]

  2. 2.

    single linkage (SL) [50]

  3. 3.

    TAHC [61]

  4. 4.

    NC-link [38]

  5. 5.

    KnA [8]

  6. 6.

    GDD [33]

  7. 7.

    DPC [52]

  8. 8.

    spectral clustering (SPC) [11]

  9. 9.

    cciMST [45]

  10. 10.

    Chameleon [39]

  11. 11.

    FINCH [55]

  12. 12.

    SNN [28]

  13. 13.

    RNN-DBSCAN (RNN-DB) [9]

  14. 14.

    RSC [57]

  15. 15.

    CHKNN [51]

Among the above compared methods, k-means is one of the partitional clustering algorithms. SL, TAHC, NC-LINK and KnA are hierarchical clustering algorithms. Both GDD and DPC are density-based clustering algorithms. SPC, cciMST and chameleon are graph-based clustering algorithms. The remaining five algorithms including FINCH, SNN, RNN-DB, RSC and CHKNN are neighbor-based algorithms. FINCH and CHKNN are k-nearest-neighbor based clustering algorithms. SNN and RNN-DB are variants of DBSCAN based on respectively shared nearest neighbors and reverse nearest neighbors. RSC is a reciprocal-nearest-neighbors based clustering algorithm. The source code of the proposed algorithm is available online.Footnote 1

For k-means and SPC, we take the best clustering results out of 10 trial runs performed for each data set. As for KnA, the parameter K in its partitional clustering process is set as 0.1 N (N represents the total number of data points). For DPC, the cutoff distance dc is 2%. The parameters of chameleon are set as follows: k = 10 (k is the number of nearest neighbors), minSize = 2.5% and alpha = 2.0. For SNN, the parameters are set as follows: k = 50, Eps = 20 and MinPts = 34. And for RSC, the parameter α is 1.5. As for the three parameters P1, P2 and P3 in CHKNN, we choose the optimal values according to the suggestion of the reference [51].

4.2 Clustering results on synthetic data sets

The data set DS1 is composed of four parallel clusters with different densities. DS2 contains one Gaussian distributed cluster and two round clusters. Both DS1 and DS2 are well separated. DS3 is composed of one spherical cluster and three half-moon clusters. DS4 contains four spherical clusters. DS5 is composed of one none-sphere-shaped cluster and four spherical clusters. The clustering results for the five synthetic data sets (DS1-DS5) are shown in Figs. 5, 6, 7, 8, 9.

Fig. 5
figure 5

Clustering result on DS1

Fig. 6
figure 6

Clustering result on DS2

Fig. 7
figure 7

Clustering result on DS3

Fig. 8
figure 8

Clustering result on DS4

Fig. 9
figure 9

Clustering result on DS5

As a partitional clustering algorithm, k-means can identify the four spherical clusters properly in DS4, but it fails to provide the desired clustering results for the other four data sets containing nonspherical clusters. Some hierarchical clustering algorithms, like SL and KnA, perform well on DS1 and DS2. However, the hierarchical clustering algorithms are sensitive to noise and outliers. Therefore, the hierarchical clustering algorithms (including SL, TAHC and KnA) tend to produce unsatisfactory results when partitioning the data set with noise. For the two density-based clustering algorithms, the performance of GDD is superior to that of DPC. As DPC is sensitive to the cutoff distance, this algorithm may need to be run multiple times. Among the graph-based clustering algorithms, the clustering performance of cciMST is better than that of SPC and chameleon. However, the time complexity of cciMST is much greater than O(N2).

Among the six neighbor-based clustering algorithms (FINCH, SNN, RNN-DB, RSC, CHKNN and NTHC), both FINCH and RNN-DB failed to detect the expected clusters for all the data sets. FINCH performed the merging process according to the 1-nearest neighbor graph, which may lead to the wrong results. RNN-DB defined the density of data point based on its reverse nearest neighbor. And incorrect density calculation can affect the results. For the four neighbor-based clustering algorithms (SNN, RNN-DB, RSC, and CHKNN), they do not require the number of clusters in advance, thus the number of clusters obtained from these algorithms are likely not exactly equal to the actual cluster number. SNN requires three input parameters that may influence the results. SNN can find the proper clusters for DS4, but it failed to find the proper clusters on the remaining four data sets. RSC introduced the reciprocal nearest neighbors to merge data points and can automatically determine the number of clusters. But for some data sets such as DS2 and DS5, it cannot determine the desired number of clusters and performs relatively poor. CHKNN cannot provide improper clustering results for the five data sets. CHKNN needs three parameters to build a hybrid K-nearest-neighbor graph which will guide the process of identifying the isolated cluster. Hence incorrect parameters may affect the clustering results. NTHC can find the correct clusters for all the five data sets.

4.3 Clustering results on real data sets

To evaluate the goodness of clustering results, we exploit four external clustering validation indices: Accuracy (AC), Precision (PR), Recall (RE) and F1-measure (F1) [18]. Tables 3-10 show the cluster validity indices results for eight real data sets. For GDD, SNN, RNN-DB, RSC and CHKNN, the number of clusters obtained from them is likely not exactly equal to the actual cluster number, therefore, some indices cannot be computed. In this case, we mark with “- “in Tables 3, 4, 5, 6, 7, 8, 9, 10.

Table 3 The performance comparison of sixteen clustering algorithms on Glass
Table 4 The performance comparison of sixteen clustering algorithms on Zoo
Table 5 The performance comparison of sixteen clustering algorithms on Ecoli
Table 6 The performance comparison of sixteen clustering algorithms on Breast-Cancer
Table 7 The performance comparison of sixteen clustering algorithms on Segment
Table 8 The performance comparison of sixteen clustering algorithms on SCADI
Table 9 The performance comparison of sixteen clustering algorithms on SECOM
Table 10 The performance comparison of sixteen clustering algorithms on CNAE9

For the Glass data set, the performance of NTHC is the best. The clustering performances of NTHC on Zoo and Breast-Cancer data sets are superior to others. For the Ecoli data set, the AC, RE and F1 values of NTHC are higher than other algorithms. For the Segment data set, the F1 value of NTHC is higher than other algorithms. And for three high dimensional data sets (SCADI, SECOM and CNAE9), the performance of NTHC is also superior to other algorithms.

4.4 Friedman test on experimental results

In this section, we test the statistically significant differences among the sixteen clustering algorithms using the Friedman test. Given a clustering algorithms and b experiments (here we set a = 16, b = 13), the null hypothesis (H0) of the Friedman test is that all the a algorithms are equivalent. The a algorithms are ranked according to their F1 values in each experiment. Then the Friedman statistic is calculated as follows:

$$ {x}_F^2=\frac{12b}{a\left(a+1\right)}\sum \limits_{i=1}^a{\left({r}_i-\frac{a+1}{2}\right)}^2=34.2437 $$
(11)

where ri is the average rank value of the i-th algorithm. The average rank values of the sixteen clustering algorithms are shown in Table 11.

Table 11 The average rank values of the sixteen clustering algorithms

Next, the statistic FF is given as:

$$ {F}_F=\frac{\left(b-1\right){\chi}_F^2}{b\left(a-1\right)-{\chi}_F^2}=2.5562 $$
(12)

The statistic FF follows an F-distribution with (a-1) and (a-1)(b-1) degrees of freedom. The critical value of F(15,180) is 1.722 when α is equal to 0.05. Therefore, we reject the null hypothesis H0 and conclude that there are some significant differences among the sixteen algorithms.

Then, we further perform the Nemenyi test to determine which of the algorithms are statistically different. According to the Nemenyi test, the performance of two algorithms can be regarded as significantly different when their corresponding average rank values differ by at least the critical difference CD:

$$ CD={q}_{\alpha}\sqrt{\frac{a\left(a+1\right)}{6b}=6.4052} $$
(13)

where qα is the critical value. For α = 0.05 and a = 16, qα = 3.426.

Figure 10 shows the results of the above tests. The vertical axis denotes the sixteen algorithms and the horizontal axis denotes the average rank values of the algorithms. Each algorithm corresponds to a horizontal line, where the line length represents the critical difference CD, and the black dot on the line corresponds to the average rank value of each algorithm. As can be seen from Fig.10, NTHC has the best average ranking, and the performance of NTHC is clearly superior to that of TAHC, DPC and SPC.

Fig. 10
figure 10

Graphical presentation of results

4.5 Parameter setting

In this section, we will discuss the values of three parameters involved in NTHC, including k, thr, and ths. Figure 11 shows the values of F1 on six data sets with the variation of different parameters. When varying one parameter value, the other two parameters are fixed. The fixed values of k, thr, and ths are 10, 0.2 and 0.65, respectively.

Fig. 11
figure 11

The effect of the values of three parameters (including k, thr, and ths) on clustering results. (a) k vs F1, (b) thr vs F1, (c) ths vs F1

The parameter k is the number of nearest neighbors. Figure 11(a) shows the values of F1 on six data sets with different k values, in which the k value ranges from 5 to 70. We can see from Fig. 11(a) that, for Wine, WheatSeeds, DrivFace, and SpamBase, the value of F1 varies smoothly from 15 to 70. For Pendigits, the value of F1 fluctuates slightly within the range of 15–70. Note that because Soybean contains only 47 data points, the algorithm terminates when k increases to about 41. In addition, the value of F1 drops considerably from about 28 due to the small amount of data points. And it can be seen from Fig. 11(a) that, for all the six data sets, the value of F1 achieves the maximum around 10. Thus, the parameter k is suggested to be set to 10.

The parameter thr is a proportion parameter to detect the outliers. We can see from Fig. 11(b) that the value of F1 fluctuates relatively greatly with the change of thr. The clustering performance is sensitive to the parameter thr. Our algorithm can achieve good performance on the six data sets when thr is set to 0.2. Hence, the thr value is set to 0.2 in this paper.

The parameter ths is a threshold to measure the similarity of two data points. As shown in Fig. 11(c), the performance of NTHC on six data sets tends to be relatively insensitive to the value of the parameter ths. According to Fig. 11(c), our algorithm achieves good clustering results on the six data sets when ths is set to 0.65. Therefore, the ths value is suggested to be set to 0.65 in this paper.

5 Conclusion

In this paper, a novel neighborhood-based hierarchical clustering algorithm NTHC, is presented. It utilizes the reverse nearest neighbor to detect and remove the outliers in the data set. Guided by the 1-nearest neighbor graph, the data points with stable connection are merged. We propose the intercluster distance measure based on the linked representatives and extended representatives. The main innovations of this paper include the organic combination of various kinds of neighbor relations. And we propose the improved hierarchical clustering algorithm based on this idea.

We evaluate our algorithm on five synthetic data sets and eight real data sets. The performance of our algorithm is compared with fifteen clustering algorithms. NTHC can discover the correct clusters for all the five synthetic data sets. And the performances of NTHC on most data sets are superior to the compared methods. Furthermore, we test the statistically significant differences among the sixteen clustering algorithms using the Friedman test. The average rank value of NTHC is 4.19. And NTHC has the best average ranking among all sixteen clustering algorithms.

However, our algorithm requires the number of clusters in advance. In the future work, we plan to further optimize our algorithm to automatically determine the number of clusters. In addition, we intend to apply the neighbor relation to other different types of clustering algorithms.