An efficient clustering algorithm based on the k-nearest neighbors with an indexing ratio

Qaddoura, Raneem; Faris, Hossam; Aljarah, Ibrahim

doi:10.1007/s13042-019-01027-z

An efficient clustering algorithm based on the k-nearest neighbors with an indexing ratio

Original Article
Published: 18 November 2019

Volume 11, pages 675–714, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

An efficient clustering algorithm based on the k-nearest neighbors with an indexing ratio

Download PDF

1084 Accesses
42 Citations
Explore all metrics

Abstract

Clustering is a challenging problem that is commonly used for many applications. It aims at finding the similarity between data points and grouping similar ones into the same cluster. In this paper, we introduce a new clustering algorithm named Nearest Point with Indexing Ratio (NPIR). The algorithm tries to solve the clustering problem based on the nearest neighbor search technique by finding the nearest neighbors for the points that are already clustered based on the distance between them and cluster them accordingly. The algorithm does not consider all the nearest points at once to cluster a single point but iteratively considers only one nearest point based on an election operation using a distance vector. NPIR tries to solve some limitations of other clustering algorithms. It tries to cluster arbitrary shapes which have non-spherical clusters, clusters with unusual shapes, or clusters with different densities. NPIR is evaluated using 20 real and artificial data sets of different levels of complexity with different number of clusters and points. Results are compared with those obtained for other well-known and common clustering algorithms. The comparative study demonstrates that NPIR outperforms the other algorithms for the majority of the data sets in terms of different evaluation measures including Homogeneity Score, Completeness Score, V-measure, Adjusted Mutual Information, and Adjusted Rand Index. Furthermore, NPIR is experimented on a real-life application for segmenting mall customers for effective decision making. The source code of NPIR is available at http://evo-ml.com/2019/10/28/npir/.

A Simple Clustering Algorithm Based on Weighted Expected Distances

A Multi-objective Approach for Inter-cluster and Intra-cluster Distance Analysis for Numeric Data

cs-means: Determining optimal number of clusters based on a level-of-similarity

Article 06 October 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering is a technique for data analysis which shows the structure of data points and explores useful patterns from these points. Clustering is used to group data points so that each group contains similar points which are dissimilar to others in other groups [1]. Clustering observes the features of each point to predict the group in which each point belongs to.

Clustering algorithms have been used in wide range of applications, for example, it is used for bioinformatics [2], image processing [3,4,5,6,7], pattern recognition [8,9,10], document categorization [11, 12], financial risk analysis [13], cancerous data [14, 15], search engines [16], academics [17], drug activity prediction [18], and much more.

Many algorithms were proposed for clustering in the literature, but still there is not such an algorithm that can fit for all types of data sets. That is, an algorithm might perform better than other algorithms for a specific data set but not for another one. Algorithms might perform differently for different sizes of data sets, different dimensionality, or different densities. In addition, some of the algorithms are more simple and easier to understand than others [19].

One of the most popular algorithms for clustering is the k-means algorithm which was proposed over 60 years ago, and it is still widely used [20]. The k-means algorithm has simple implementation and fast execution which makes it more popular. However, k-means assumes a known number of clusters, a spherical distribution of points within the cluster, and same clusters size or density [21] which might not be true for all types of data sets. k-means also might fall in local optima and fixed shaped clustering depending on the initial assignments of the centroids. A popular variation of k-means is k-means++ which works similar to k-means but chooses the initial assignments of the centroids [22]. Other popular clustering algorithms include Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) [23], density-based spatial clustering of applications with noise (DBSCAN), hierarchical DBSCAN (HDBSCAN) [24], Expectation-Maximization (EM) [1], Minimum Spanning Tree (MST) [25, 26], single linkage hierarchical clustering (HC-SL) [25, 27, 28], complete linkage hierarchical clustering (HC-CL) [28, 29], and nature-inspired algorithms such as genetic algorithm (GA) [30], Particle Swarm Optimization (PSO) [31,32,33], and static and dynamic clustering based multi-verse optimizer (SCMVO and DCMVO) [34].

In this paper, we propose a new clustering algorithm with the aim of clustering different types of data sets including challenging ones in the literature which have different cluster densities, points distribution, and different sizes. We also try to overcome the fixed shaped clustering which is an observed problem in other clustering algorithms. The new proposed algorithm is referred to as Nearest Point with Indexing Ratio (NPIR). The algorithm tries to solve the clustering problem based on the nearest neighbor search technique by finding the nearest neighbors for the points that are already clustered based on the distance between them and cluster them accordingly. The algorithm does not consider all the nearest points at once to cluster a single point but iteratively considers only one nearest point based on several operations using a distance vector. It also includes settings of two parameters which are the Indexing Ratio (IR) and the iterations (i) to optimize the generated results more further. In general, NPIR differs from other clustering algorithms in that it tries to partition data of arbitrary shapes, non-spherical clusters, or clusters having different densities.

Twenty data sets are used for experiments which are Aggregation, Aniso, Appendicitis, Blobs, Circles, Diagnosis II, Flame, Glass, Iris, Iris 2D, Jain, Moons, Mouse, Pathbased, Seeds, Smiley, Varied, Vary density, WDBC, and Wine. These data sets have different number of points, features, and clusters. Some of them are considered challenging in the literature as they have complex shapes including circular patterns and interleaving clusters. Others have visual separated parts but are computationally difficult to partition. Furthermore, the proposed NPIR gives enhanced results compared to the other well-known clustering algorithms. NPIR iteratively considers correcting wrongly clustered points which produces better results and avoids the termination of cluster shaping process at an early stage.

The proposed NPIR is evaluated in terms of Homogeneity Score, Completeness Score, V-measure, Adjusted Mutual Information, and Adjusted Rand Index. Results are compared with k-means++, BIRCH, HDBSCAN, EM, MST, HC-SL, HC-CL, GA, PSO, and DCMVO. The experiments show that NPIR outperforms the other algorithms in the majority of the data sets, and it is promising for many applications. Mall customer segmentation application is also experimented and analyzed.

The remainder of the paper is organized as follows: The next section presents the most common algorithms and the recent work on clustering. Section 3 describes in details the proposed algorithm. Section 4 discusses the experiments and results along with sensitivity analysis for the NPIR parameters. The last section concludes the work.

2 Related work

Clustering is a common task in machine learning which can be used for finding the characteristics of each group of clusters. Numerous previous works discussed clustering techniques and applications.

k-means is one of the earliest algorithms that was recognized in several applications as a useful algorithm for clustering. Since then, the importance of the clustering techniques and their positive effect on real life applications has been recognized. Clustering algorithms can be classified into partitioning algorithms, hierarchical algorithms, and density-based algorithms [1, 35, 36].

Partitioning algorithms are the simplest and the most basic algorithms for clustering. Some popular algorithms in the partitioning category include k-means [20], k-means++ [22], and EM [1]. Algorithms of this category relatively require less computational time compared to the other algorithms in other categories. They also consider initial clustering results with iterative correction of wrongly clustered data points. They also have random behavior as they consider different clustering results for different executions which increases the possibility of finding correct clustering results and enhancing the quality of these results. However, they do not work well for non-spherical shapes [35] and they usually fall in local optima.

Hierarchical algorithms such as BIRCH algorithm [23] group data points into a hierarchy of clusters. The strength of such algorithms is that they can detect arbitrary and non-spherical shapes that cannot be detected by partitional algorithms. However, they are unable to correct wrongly clustered data points as they cannot undo the clustering of the data points [1]. They also consume much time and space [36].

Density-based algorithms such as DBSCAN [37] and OPTICS [38] can also work well with non-spherical shapes but they fail to detect shapes of different densities [35, 36]. DBSCAN also requires careful selection of its parameters [35]. Thus, HDBSCAN [24] was recently introduced to solve the parametric and border points problems by revising the DBSCAN and OPTICS algorithms and generating a refined version of the two algorithms considering both density and hierarchy clustering.

Variations of the aforementioned algorithms are observed in the literature and several recent approaches were proposed. The work of [39] considers the MinMax k-means algorithm as a weighted variation of the k-means which assigns weights to the clusters according to the clusters variance. Site rates clustering with an automatic selection of partitioning schemes using iterative k-means is presented in [40]. Authors in [41] proposed a variant method for finding initial centroids using entropy-based farthest neighbor approach. An algorithm that works as a mask for the EM algorithm which is suitable for large and high dimensional data sets is presented in [42]. Parallel implementation for large scale data sets are also exist in literature [43,44,45,46].

Nature-inspired algorithms are used for clustering such as GA [30], PSO [31,32,33], MVO [47, 48], and GWO [49, 50]. Variations of these algorithms can be found in the literature: A genetic algorithm-based clustering was proposed by [51] for constrained networks. Automatic clustering for finding the right number of clusters using genetic algorithm based clustering algorithm can be found in the work of [52]. Traveling salesman problem is also considered by [53] by proposing a new initial population strategy for the genetic algorithm. Adaptive PSO based on clustering (APSO-C) can be found in the work of [54]. A combination of fuzzy clustering pre-processing and PSO for clustering can also be found in the work of [55]. Authors in [56] proposed a hybrid approach combining k-means algorithm with Particle Swarm Optimization for clustering Arabic documents. In addition, authors in [34] proposed a clustering algorithm using multi-verse optimizer with two modes: static clustering with predefined number of clusters (SCMVO) and dynamic clustering with automatic detection of the number of clusters (DCMVO).

Under this view, NPIR is introduced and discussed in this work. NPIR takes advantage of both partitional and density-based algorithms to detect data sets of arbitrary and non-spherical shapes while it also considers iterations to correct wrongly clustered data points. It is based on the nearest neighbor search technique and a random and iterative behavior of the partitional clustering algorithms to give quality clustering results. It does not depend on cluster centers to predict clusters as in partitional clustering algorithms but instead considers cluster density as in density-based clustering algorithms. However, it also does not consider a user to define the density as in DBSCAN but it rather automatically adjust different densities and sizes of clusters. NPIR is discussed in details in the next section.

3 Nearest Point with Indexing Ratio (NPIR)

NPIR is a clustering algorithm which explores the characteristics of the data points to group similar data points into the same clusters and dissimilar data points into different clusters. It is based on the nearest neighbor search technique in finding a k-nearest neighbor to a certain point. The algorithm iterates to assign data points to the most suitable clusters. It performs Election, Selection, and Assignment operations to assign data points to appropriate clusters. In this section, we discuss the algorithm details, operations, and pseudocode. A simple example is also presented to illustrate the algorithm more further. Furthermore, the complexity of the algorithm is discussed.

3.1 Preliminaries

NPIR clustering algorithm is based on the nearest neighbor search technique where the k-nearest neighbor of a certain point is considered for assignment to the cluster of that point. Searching for the k-nearest neighbor is considered as a nearest neighbor search (NNS) problem. NNS problem can be defined as follows [57]:

Definition 1

Given a set of N points $P = \{p_1, p_2 \ldots p_N\}$ in space, find the k-nearest neighbors set to a certain point $p_i$ where $p_i \in \{p_1, p_2 \ldots p_N\}$.

NPIR does not only consider a single point as the nearest neighbor for a given point but it also considers different k-nearest neighbors (k-NN) at different iterations of the algorithm. Thus, k-NN can be defined as follows [36]:

Definition 2

Given a set of N points $P = \{p_1, p_2 \ldots p_N\}$ in space, the ${{\text {k-NN}}}(p_i) = \{nn_{1}, nn_{2} \ldots nn_{k}\}$ represents the k-nearest neighbors set of a certain point $p_i$ where $k = |{{\text {k-NN}}}(p_i)|$ and $k < N$, a nearest neighbor $nn_j \in \{nn_{1}, nn_{2} \ldots nn_{k}\}$.

Furthermore, the nearest neighbor is discovered using the distance between points. In this study, we use the euclidean distance between point $p_i$ and $nn_j$ which can be defined as follows [58, 59]:

$$dist(p_i,nn_j)=\sqrt{\sum _{r=1}^{d}(p_{ir}-nn_{jr})^{2}}$$

(1)

where d is the dimension or the number of features.

3.2 NPIR description and steps

The main idea of NPIR is to find the nearest neighbors for the points that are already clustered based on the distance between them and cluster them accordingly. In general, NPIR names the nearest neighbors for the points that are already clustered as Nearest points. Thus, NPIR finds the Nearest point for an already assigned point and then clusters the Nearest point to the same cluster of the assigned point under some conditions which are discussed in this section.

The algorithm takes as an argument a data set containing the points that need to be clustered along with three parameters that need to be determined before running the algorithm, the parameters are:

The number of clusters (k). The value of k can be any integer larger than 1.
The indexing ratio (IR). This parameter controls the amount of possible reassignment of points. That is, higher IR value means that the assigned points have more possibility for reassignment. In contrast, lower IR value means that the assigned points have less possibility for reassignment. The value of IR should be between 0 and 1.
The number of iterations (i). It determines the number of times the algorithm repeats. Therefore, the algorithm terminates when the maximum number of iterations i is reached. The value of i can be any integer larger than 0.

At a preparatory stage, the algorithm generates the distance vector for each point. The distance vector of a point is a vector that contains all the other points in the data set sorted in ascending order according to the distance between them and the corresponding point. In addition, a pointer is defined for each distance vector which indicates the current index of the vector. At this stage, the current index of the pointer for all vectors refers to the first index of the vector. In practice, K-dimensional tree [60,61,62] can be used instead of the distance vectors for large data sets to calculate the distance between each pair of points which is discussed more further at the end of this section. However, we use the distance vector data structure for small data sets and we consider it in our discussion for simplicity.

Furthermore, suppose we have 4 points in space with defined distances between them as shown in Fig. 1. Each point has its own corresponding distance vector and a pointer at the first index of the vector. For example, a distance vector for a given point P1 consists of points P2, P4, and P3 which are sorted according to the distance between them and P1 with distance values of 3, 6, and 8, respectively. In addition, the current index of the distance vector for P1 is the index of P2 which is defined using the pointer of the distance vector of P1. The figure also shows the distance vector and its corresponding pointer for points P2, P3, and P4. The distance vector and its pointer are used later to select the points that are considered for clustering.

The algorithm starts by randomly assigning an initial point for each cluster resulting in k initial points. The initial points are used to assign other points to the clusters. The algorithm then iterates for a predefined number of iterations according to the i parameter discussed earlier. For each iteration in i, inner iterations are also executed to allow for assignment for the unassigned points and reassignment for the already assigned points. For each inner iteration, one point is considered for assignment using the following operations which are illustrated in an example of 28 points:

Election: It randomly elects one of the points that are already assigned to a cluster and names it as ‘Elected’ or shortly as ‘E’ which is illustrated in Fig. 2. Upon the assignment of the initial points, the election is done for one of the initial points. In contrast, the remaining elections are done for one of the already assigned points. The election operation is random and is conducted in no predefined order. Furthermore, a point might be elected several time while other points are not elected at all. The randomness behavior of such operation gives advantage of defining different shapes for clustering and enhancing the diversity of searching for data points in space.
Selection: The point at the pointer of the distance vector of the Elected is selected and is named ‘Nearest’ or shortly as ‘N’ and the pointer is incremented by one for possible next selection. Figure 3 illustrates the selection process for the Nearest; Fig. 3a illustrates the selection of the Nearest and Fig. 3b illustrates incrementing the pointer.
Assignment: To assign the Nearest to the cluster of the Elected, a check must be made in order to decide if the Nearest should be assigned to the same cluster as the Elected. The following are possible conditions for the Nearest:
- Case 1: The Nearest is not yet assigned to a cluster. In this case, the Nearest is directly assigned to the cluster of the Elected and the Elected is marked as the ‘Assigner’ or shortly as ‘A’ for the Nearest. Therefore, the Assigner is the point that has been Elected and has successfully assigned the point to its cluster (see Fig. 4a). In addition, the Nearest is added as a descendant to the Assigner. A tree data structure with hash table is used for marking preceding parent and descendants.
- Case 2: The Nearest is already assigned to the same cluster of the Elected. Therefore, no reassignment is made (see Fig. 4b). As a complementary step, the Elected might become the new Assigner for the Nearest if it is closer to the Nearest than its old Assigner.
- Case 3: The Nearest is already assigned to a different cluster than the Elected. In this case, the algorithm checks if the Nearest should move from its cluster to the cluster of the Elected by measuring the distance between the Nearest and its Assigner, and the distance between the Nearest and the Elected. Then it compares the two calculated distances. The following apply:
  - Case 3a: If the distance between the Nearest and the Assigner (d1) is less than or equal to the distance between the Nearest and the Elected (d2), then the Nearest does not move to the cluster of the Elected (see Fig. 4c).
  - Case 3b: If the distance between the Nearest and the Assigner (d1) is greater than the distance between the Nearest and the Elected (d2), the Nearest moves to the cluster of the Elected (see Fig. 4d) and the Elected is marked as the Assigner for the Nearest. This can avoid the termination of cluster shaping process at early stage. It is highly probable that the points on the boundaries will be finally reassigned according to the corresponding elected point. Unlike partitional clustering algorithms like k-means which usually fall in local optima and get stuck into fixed shapes of clusters, NPIR explores more shapes by reassigning points over the course of iterations and can handle arbitrary shaped clusters.

The inner iterations terminate when all points are assigned and TotalIndex value is reached. The TotalIndex value is calculated using the following formula:

$$\begin{aligned} TotalIndex = Round(IR * n^2) \end{aligned}$$

(2)

where IR is the indexing ratio, and n is the number of points. TotalIndex value represents the amount of possible reassignment of points. $n^2$ represents the size of all the distance vectors since n points have n number of elements in the distance vector. Therefore, TotalIndex represents the total indices reached for all the points which is the amount of possible reassignment of points since IR value falls between 0 and 1. This allows for more possible reassignment of points depending on the IR value even after all points are already assigned.

Furthermore, when the inner iterations terminate, pointers are reset and points to the first element of the distance vector for all the points which allows for more possible reassignment of points for the next iteration in i.

NPIR is described by the pseudocode given in Algorithm 1. As shown from the pseudo code, The algorithms accepts 3 parameters; k, IR, and i along with the set of points that are considered for clustering. The preparatory steps are performed in lines 2–4 which creates the distance vector and calculates the TotalIndex value. The algorithm starts in line 5 upon selecting the initial points. i iterations are performed in lines 6–22 where it terminates when the value of i is reached. For each iteration in i, inner iterations are performed to assign points to clusters which are performed in lines 7–20 where it terminates when all points are clustered and TotalIndex value is reached. For each inner iteration, the election, selection, and assignment operations are performed in lines 8, 9–11, and 12–19, respectively. Case 1, 2, and 3 of the assignment operation are shown in the if statement in line 12. Each iteration in i terminates by resetting the pointer for all the distance vectors as shown in line 21. the algorithm then returns the predicted assignment for each point.

The following notes are applied in the algorithm:

Distance vectors construction can be replaced by K-dimensional tree data structure [60, 61] for faster execution. K-dimensional tree is a binary search tree used with nearest neighbor search tasks to organize data points into K-dimensional space with the aim of finding a nearest point to some other point in the fastest possible way. It can be used in NPIR for large data sets to fasten the process of constructing the nearest neighbors of every data point in space. Thus, we can construct the K-dimensional tree at the preparatory stage and use it in later stages for finding the Nearest of an Elected in the Selection operation. The complexity analysis for both data structures is further explained in Sect. 3.4.
When an assigned point is selected as the Nearest from an Elected point and is reassigned to the cluster of the Elected, all the descendants of the Nearest point are also reassigned to the cluster of the Elected. Descendants are the points that have the Nearest or any other descendant of the Nearest as their Assigner.
If the IR value is zero which results in zero value of TotalIndex, then the inner iteration terminates just upon assignment of all points. That is, the algorithm executes in its simplest form where an iteration terminates once all points are assigned. Also, if the IR value is 1 which is the maximum value for the IR, then all the points are marked as Nearest by every other point marked as Elected. In other words, the pointer for all the points reach the last index for the distance vector for each iteration in i. This adds unnecessary complexity having more elections of points with more selections of far points with no reassignments, which is not recommended.
Upon moving the Nearest to the cluster of the Elected and if the cluster of the Nearest contains the Nearest and its descendants only, then a new random point is selected to be clustered to the cluster of the Nearest so that the cluster is not left empty and the k value is preserved. The random point is selected from the pool of points that are not yet assigned. However, if all the points are already assigned, then the random point is selected from the pool of points that are already assigned and is assigned to the cluster of the Nearest along with its descendants.
An Elected point might be elected more than once allowing for the selection of a different Nearest by the Elected which makes the detection of dense clusters more possible and more reassignments are also possible.
If case 3 is applicable for the Nearest but the Nearest does not have an Assigner which is the case if it was selected as an initial point, then the decision is made for case 3b and the Nearest is reassigned to the cluster of the Elected to allow for less identification of null Assigners and less dependency on the randomness selection of the initial points which can avoid fixed shaped clustering at early iterations.

3.3 A simple example

To illustrate the algorithm more further, a simple example of nine points is discussed in this section as shown in Figs. 5 and 6. For simplicity, the example considers a value of 1 for the parameter i which means that only one iteration is performed.

The algorithm at the preparatory stage calculates the distance between each two points which generates nine vectors where each vector is related to a single point as shown in Fig. 5. For example, point 2, 3, 4, 8, 9, 6, 5, and 7 are sorted according to their distance with point 1 in the first vector having point 2 as the Nearest and point 7 as the farthest. Then initial points are randomly set for each cluster as shown in Fig. 6a and the pointers are already reset to the first element of the vectors. The remaining operations are then performed for each sub figure in Fig. 6 as follows:

Figure 6b illustrates the first election of random point from the set of assigned points which is point 4. According to the distance vector of point 4, the Nearest is selected which is point 3. As a result, the assignment operation is performed and point 3 is clustered to the same cluster of point 4 according to case 1 discussed in the algorithm. Point 4 becomes the Assigner for point 3 and point 3 becomes a descendant of point 4 accordingly. In addition, the pointer of the distance vector for point 4 is incremented by 1 to allow for possible next election for point 4 and selection for the next Nearest which is point 1 in later iterations.
Figures 6c, d illustrate the next elections, selections, and assignments where point 5 and point 1 are clustered to the cluster of point 6 and point 4 respectively according to case 1 discussed in the algorithm. It is noted in Fig. 6d that the randomness of the election operation in which point 4 is selected again makes possible different clustering alternatives and defines different shapes for clustering.
Figure 6e illustrates the next election of random point from the set of assigned points which is point 2. According to the distance vector of point 2, the nearest point is selected which is point 1. As a result, the assignment operation is performed and point 1 is reassigned to the cluster of point 2 according to case 3b discussed in the algorithm because the distance d1 between the Assigner (point 4) and the Nearest (point 1) is greater than the distance d2 between the Elected (point 2) and the Nearest (point 1). In this case, point 2 becomes the new Assigner for point 1. In addition, the pointer of the distance vector for point 2 is incremented by 1 to allow for possible next election for point 2 and selection for the next Nearest which is point 3 in later iterations.
Figure 6f illustrates the next election of random point from the set of assigned points which is point 6. According to the distance vector of point 6, the nearest point is selected which is point 7 and the pointer of the distance vector for point 6 is incremented by 1. As a result, the assignment operation is performed and Point 7 is clustered to the same cluster of point 6 according to case 1 discussed in the algorithm.
Figure 6g illustrates the next election of random point from the set of assigned points which is point 4. According to the distance vector of point 4, the nearest point is selected which is point 2 and the pointer of the distance vector for point 4 is incremented by 1. As a result, the assignment operation is performed and point 2 is reassigned to the same cluster of point 4 according to case 3b discussed in the algorithm because the Nearest does not have an Assigner so the decision is made for case 3b and the Nearest is reassigned to the cluster of the Elected along with its descendants which is point 1 (point 2 is the Assigner to point 1). In this case, point 4 becomes the new Assigner for point 2. Furthermore, a random point is selected from the pool of the unassigned points which is point 9 and is assigned to the empty cluster which preserves the k value of the clustering. This shows how the algorithm can avoid the termination of cluster shaping process at early stage and can detect arbitrary shapes.
Figure 6h illustrates the next election of random point from the set of assigned points which is point 2. According to the distance vector of point 2, the nearest point is selected which is point 3 and the pointer of the distance vector for point 2 is incremented by 1. As a result, the assignment operation is performed and no assignment is made for point 3 because it is assigned to the same cluster of point 2 according to case 2 discussed in the algorithm. Furthermore, point 2 become the new Assigner for point 3 because it is closer to point 3 than point 4 which is the old Assigner.
Figure 6i illustrates the next election, selection, and assignment where point 8 is clustered to the cluster of point 9 according to case 1 discussed in the algorithm. At this stage all points are successfully assigned.
Figure 6j illustrates a continued step if the value of the TotalIndex is not reached. In this case, the next election of random point from the set of assigned points is performed which is point 4. According to the distance vector of point 4, the nearest point is selected which is point 8 and the pointer of the distance vector for point 4 is incremented by 1. As a result, the assignment operation is performed and no reassignment is made for point 8 because the distance d1 between the Assigner (point 9) and the Nearest (point 8) is less than the distance d2 between the Elected (point 4) and the Nearest (point 8) according to case 3a discussed in the algorithm. In this case, point 9 remains as an Assigner for point 8.
The algorithm then continues and reassignments are done until the TotalIndex value is reached.

As observed from the simple example, NPIR can detect data sets of arbitrary shapes having different cluster densities, points distribution, and different sizes. Unlike the other partitional clustering algorithms such as k-means which get stuck into fixed spherical shaped clusters, NPIR explores more shapes over the course of iterations. It is highly probable that the points on the boundaries will be finally reassigned according to the corresponding elected point which can be found in case 3b.

Figure 7 illustrates how different clustering results are achieved using NPIR compared to k-means for arbitrary shaped clusters after a course of iterations. It is observed from the figure that k-means trapped in fixed shaped clusters after few iterations and is unable to detect arbitrary shapes because it is a centroid-based clustering algorithm where the red points indicates the centroids of the clusters. Some green and yellow points that are adjacent to the centroids were selected by k-means although they belong to different clusters. In each iteration, the assignments of the centroids will be slightly changed resulting in fixed shaping of the clusters around the centroids that were selected. In contrast, NPIR can detect arbitrary shapes by avoiding the termination of the cluster shaping process at early stages and exploring more shapes over the course of iterations due to the reassignment behavior of the points at the boundaries. Some green points of the clusters achieved in early iterations are more adjacent to some other yellow points in different clusters which can be reassigned in later iterations to the cluster of the yellow points through the reassignments of points. This results in exploring more shapes over the course of iterations and finding arbitrary shaped clusters. Section 4 gives more examples of arbitrary shapes detection by NPIR.

3.4 Complexity analysis

As mentioned earlier, two data structures can be used for the construction of the distances between data points and searching for the nearest neighbors. In this section, we discuss time and space complexity of using vectors and the K-dimensional tree [60, 61] as two data structures for implementing the algorithm:

Vectors: The time complexity of the algorithm using the distance vectors can be analyzed as follows:

NPIP algorithm starts with the preparatory stage which creates the distance vectors by calculating the distance between each point with all the other points taking into consideration the points’ dimensionality. This takes O($n^2d$) where n is the number of points and d is the number of dimensions. Sorting the distance vector takes O(nlogn).
The selection of the initial points takes O(k) where k is the number of clusters.
The Election operation takes O($\frac{in^2}{r}$) where n is the number of points to assign, i is the number of iterations, r is the inverted number of IR parameter. Since IR is recommended to have small values which should not go beyond the value of 0.2 which is discussed in Sect. 4.3, the Election operation can be considered of time O($\frac{in^2}{5}$).
The Selection operation of pointing to the Nearest point takes O(1) since the distance vector of the Elected is already sorted.
The Assignment operation is of constant time and can be considered as O(1).

In total, The overall complexity of the algorithm using distance vectors is O($n^2d \texttt {+}nlogn \texttt {+}k \texttt {+}\frac{in^2}{5}$) which equals O($n^2d \texttt {+}\frac{in^2}{5}$ ). The best case complexity of the NPIP algorithm is O($n^2d \texttt {+}n$) when i equals 1 and there are no reassignments to be considered. For all cases, the separation between the complexity of the iterative behavior of NPIR algorithm which is O($\frac{in^2}{5}$) and the complexity of the data sets dimensionality and the number of points which is O($n^2d$) can cause a recognizable drop in the running time of the NPIR algorithm compared to other clustering algorithms where these values interchangeably effect the algorithm complexity.

On the other hand, the space complexity of the algorithm using distance vectors is O($n^2$) for storing n distance vectors of size n and storing parents-descendents points in a tree of size n.

K-dimensional tree: The time complexity of the algorithm using K-dimensional tree can be analyzed as follows:

NPIR algorithm starts with the preparatory stage which creates the K-dimensional tree for all data points. This takes O(dnlogn) where n is the number of points and d is the dimension of the tree [60].
The selection of the initial points takes O(k) where k is the number of clusters.
The Election operation takes O($\frac{in^2}{r}$) where n is the number of points to assign, i is the number of iterations, r is the inverted number of IR parameter. Since IR is recommended to have small values which should not go beyond the value of 0.2 which is discussed in Sect. 4.3, the Election operation can be considered of time O($\frac{in^2}{5}$).
The Selection operation of searching for the Nearest point takes O(ilogn) in the average case and O($in^{1 - 1/d}$) in the worst case [60] which cannot be larger than O(in) where n is the number of points, d is the dimension of the tree, and i is the number of iterations.
The Assignment operation is of constant time and can be considered as O(1).

In total, the overall complexity of the algorithm using K-dimensional tree in the average case is O($dnlogn \texttt {+}k \texttt {+}\frac{in^2}{5} \texttt {+}ilogn$) which equals O($dnlogn \texttt {+}\frac{in^2}{5}$). The best case complexity of the NPIR algorithm is O($dnlogn \texttt {+}k \texttt {+}n$) which equals O(dnlogn) when i equals 1 and there are no reassignments to be considered. For all cases, the separation between the complexity of the iterative behavior of NPIR algorithm which is O($\frac{in^2}{5}$) and the complexity of the data sets dimensionality and the number of points which is O(dnlogn) can cause a recognizable drop in the running time of the NPIR algorithm.

On the other hand, the space complexity of the algorithm is O(n) for storing the data points in the K-dimensional tree and the parents-descendants tree.

Looking into the time complexity for both data structures, distance vectors can be used for small data sets where time and space complexity is not an issue. For large data sets, we recommend using the K-dimensional tree for faster execution and less consuming of space.

4 Experiments and results

This section presents the experiments that are conducted to investigate the efficiency of the proposed NPIR clustering algorithm. NPIR is evaluated based on well-known challenging data sets, which are widely used for the performance tests of clustering algorithms. Furthermore, the performance of NPIR is compared to other well-known clustering algorithms which are k-means++ [22], BIRCH [23], HDBSCAN [24], EM [1], MST [25, 26], HC-SL [25, 27, 28], HC-CL [28, 29], GA [30], PSO [31,32,33], and DCMVO [34]. In addition, NPIR is evaluated using different measures which are Homogeneity Score (HS), Completeness Score (CS), Vmeasure (VM), Adjusted Mutual Information (AMI), and Adjusted Rand Index (ARI) [63, 64].

We used python 3.7 to implement NPIR and Scikit learn python library [65] to evaluate the algorithm using the evaluation measures. We also used scikit Learn^{Footnote 1}, Python Package Index (PyPI)^{Footnote 2}, Yarpiz packages^{Footnote 3}, and MATLAB^{Footnote 4} for the other algorithms in comparison. All experiments were conducted on a personal computer with Intel core i7-5500U CPU @ 2.40 GHz/8 GB RAM.

This section presents the evaluation measures and the data sets used in the experiments. Sensitivity analysis of the algorithm parameters and a quantitative and qualitative analysis of the results are also discussed. Computation time of the algorithm is also provided. In addition, mall customer segmentation case study is discussed and analyzed.

4.1 Evaluation measures

The evaluation of a clustering algorithm is not simply specified by the correctness of the predicted clusters against the true classes. Other considerations should be taken into account such as the separateness and cohesion of the clusters or the similarity between the data points in the same predicted cluster and the dissimilarity between data points in different clusters. With this regards, several measures can be considered for the evaluation as follows:

Given a set S of N points, C a true partitioning assignment of S, K a predicted partitioning assignment of S.

Homogeneity Score: For all clusters, it measures if all data points of a specific cluster are in the same class [63].
Given H (C) which is the classes entropy [63]:
$$H(C) = - \sum _{c=1}^{|C|} \frac{n_c}{N} \cdot \log \left( \frac{n_c}{N}\right)$$
(3)
where $n_c$ is the number of data points of class c.
Furthermore, given the assignments of the clusters K, H (C|K) is the classes conditional entropy [63]:
$$H(C|K) = - \sum _{k=1}^{|K|} \sum _{c=1}^{|C|}\frac{n_{kc}}{N} \cdot \log \left( \frac{n_{kc}}{n_k}\right)$$
(4)
where $n_{kc}$ is the number of data points from class c assigned to cluster k and $n_k$ is the number of data points of cluster k.
The Homogeneity Score is then given by [63]:
$$\text {HS} = 1 - \frac{H(C|K)}{H(C)}.$$
(5)
Completeness Score: For all classes, it measures if all data points of a specific class are clustered in the same cluster [63]
Given H (K) which is the clusters entropy [63]:
$$H(K) = - \sum _{k=1}^{|K|} \frac{n_k}{N} \cdot \log \left( \frac{n_k}{N}\right).$$
(6)
Furthermore, given the data points of the classes C, H (K|C) is the clusters conditional entropy [63]:
$$H(K|C) = - \sum _{c=1}^{|C|} \sum _{k=1}^{|K|} \frac{n_{kc}}{N} \cdot \log \left( \frac{n_{kc}}{n_c}\right).$$
(7)
The Completeness Score is then given by [63]:
$$\text {CS} = 1 - \frac{H(K|C)}{H(K)}.$$
(8)
V-measure: V-measure measures the agreement between the true partitioning assignments and the predicted partitioning assignments. V-measure is a harmonic mean between homogeneity and completeness as defined by Rosenberg and Hirschberg [63]:
$$VM = 2 \cdot \frac{HS \cdot CS}{HS + CS}.$$
(9)
Adjusted Rand Index (ARI): ARI considers pair counting to measure the similarity between two assignments [64] which is the true and the predicted assignments. The Rand index is calculated using the following formula [66, 67]:
$$\begin{aligned} \text {RI} = \frac{a + b}{\left( {\begin{array}{c}N\\ 2\end{array}}\right) } = \frac{\sum _{k,c}\left( {\begin{array}{c}n_{kc}\\ 2\end{array}}\right) }{\left( {\begin{array}{c}N\\ 2\end{array}}\right) } \end{aligned}$$
(10)
Where a is the number of pairs of points that are at the same class of C and at the same cluster of K, b is the number of pairs of points that are in different class in C and in different cluster in K, and $\left( {\begin{array}{c}N\\ 2\end{array}}\right)$ is the total number of pairs in S.
Furthermore, E[RI], max[RI] are the expected RI and maximum RI which are given by [67]:
$${\text{E(RI)}} = E\left( {\sum\limits_{{k,c}} {\left( {\begin{array}{*{20}c} {n_{{kc}} } \\ 2 \\ \end{array} } \right)} } \right) = \frac{{\sum\limits_{{k = 1}}^{{|K|}} {\left( {\begin{array}{*{20}c} {n_{k} } \\ 2 \\ \end{array} } \right)} \sum\limits_{{c = 1}}^{{|C|}} {\left( {\begin{array}{*{20}c} {n_{c} } \\ 2 \\ \end{array} } \right)} }}{{\left( {\begin{array}{*{20}c} N \\ 2 \\ \end{array} } \right)}}$$
(11)
$${\text{max}}[{\text{RI}}] = \frac{{\text{1}}}{{\text{2}}}\left[ {\sum\limits_{{{\text{k = 1}}}}^{{{\text{|K|}}}} {\left( {\begin{array}{*{20}c} {{\text{n}}_{{\text{k}}} } \\ {\text{2}} \\ \end{array} } \right)} {\text{ + }}\sum\limits_{{{\text{c = 1}}}}^{{{\text{|C|}}}} {\left( {\begin{array}{*{20}c} {{\text{n}}_{{\text{c}}} } \\ {\text{2}} \\ \end{array} } \right)} } \right].$$
(12)
Correction by chance is achieved for the rand index and is called the Adjusted Rand Index which is given by [67]:
$$\text {ARI} = \frac{\text {RI} - E[\text {RI}]}{\max (\text {RI}) - E[\text {RI}]}.$$
(13)
Adjusted Mutual Information (AMI): AMI considers Shannon information theory to measures the similarity between two partitioning assignments [64] which is the true and the predicted assignments. The Mutual index is calculated using the following formula [64]:
$$\text {MI} = \sum _{k=1}^{|K|}\sum _{c=1}^{|C|}\frac{n_{kc}}{N}\log \left( \frac{\frac{n_{kc}}{N}}{\frac{n_k}{N} . \frac{n_c}{N}}\right).$$
(14)
Furthermore, E[MI] is the expected MI which is given by [64, 68]:
$${\text{E(MI)}} = \sum\limits_{{k = 1}}^{{|K|}} {\sum\limits_{{c = 1}}^{{|C|}} {\sum\limits_{{n_{{kc}} = max(0,n_{k} + n_{c} - N)}}^{{min(n_{k} ,n_{c} )}} {\frac{{n_{{kc}} }}{N}} } } \log \left( {\frac{{N.n_{{kc}} }}{{n_{k} n_{c} }}} \right) \times \left( {\frac{{n_{k} !n_{c} !(N - n_{k} )!(N - n_{c} )!}}{{N!n_{{kc}} !(n_{k} - n_{{kc}} )!(n_{c} - n_{{kc}} )!(N - n_{k} - n_{c} + n_{{kc}} )!}}} \right).$$
(15)
Given H (C) and H (K) as the entropy of the partitioning assignment from formulas 3 and 6, correction by chance is achieved for the mutual information and is called the Adjusted Mutual Information which is given by [64]:
$$\text {AMI} = \frac{\text {MI} - E[\text {MI}]}{\max (H(K), H(C)) - E[\text {MI}]}.$$
(16)

4.2 Data sets

Twenty data sets are selected with different number of features, points, and clusters which reflects different levels of complexity. These data sets are gathered from scikit learn^{Footnote 5}, UCI machine learning repository^{Footnote 6} [69], School of Computing at University of Eastern Finland^{Footnote 7}, ELKI^{Footnote 8}, KEEL^{Footnote 9}, and Naftali Harris Blog^{Footnote 10}.

Table 1 shows the name and type of these data sets along with the number of clusters (k), features, and points. The values of IR and i parameters that are used in the experiments are also observed for each data set. Moreover, a brief description of these data sets are listed below:

$\mathbf Aggregation ^{\mathbf{7}}$: A two-dimensional data set consists of seven clusters of different shapes and sizes having some connections between some of them.
$\mathbf Aniso ^{\mathbf{5}}$: A two-dimensional data set consists of 3 unconnected isotropic Gaussian blobs with similar densities. They interleave at some points in the X and Y axis.
$\mathbf Appendicitis ^{\mathbf{9}}$: Represents the medical status of 106 patients for having appendicitis based on 7 measures.
$\mathbf Blobs ^{\mathbf{5}}$: A two-dimensional data set consists of 3 unconnected isotropic Gaussian blobs with similar densities forming spherical shapes. It is usually used to show and prove clustering as the cluster centers and standard deviation is easily controlled. That is, a clustering algorithm should at least satisfy predicting such data set.
$\mathbf Circles ^{\mathbf{5}}$: A two-dimensional data set consists of 2 clusters of different sizes forming 2 circles that are not connected to each other. One circle is surrounded with the other one. The two circles need to be visualized with a spherical boundary for binary clustering.
$\mathbf Diagnosis II ^{\mathbf{6}}$: Represents the medical status of 120 patients for having a nephritis of renal pelvis origin based on 6 symptoms.
$\mathbf Flame ^{\mathbf{7}}$: A two-dimensional data set consists of two connected clusters forming the shape of a flower.
$\mathbf Glass ^{\mathbf{6}}$: Represents the types of glass for 214 items based on their chemical elements.
$\mathbf Iris ^{\mathbf{6}}$: It consists of 4 characteristics of 150 iris plants to detect the type of each iris plant. The characteristics are the petal length, the petal width, the sepal length, and the sepal width in centimeters.
$\mathbf Iris 2D ^{\mathbf{6}}$: A modification of the previous data set found in Weka tool [70] considering only two characteristics of the iris plants which are the petal length and width.
$\mathbf Jain ^{\mathbf{7}}$: A two-dimensional data set consists of two half circles of different densities that are not connected with each other but interleave in the x and y axis for some points.
$\mathbf Moons ^{\mathbf{5}}$: A two-dimensional data set consists of 2 interleaving half circles with similar densities. The two half circles need to be visualized with a spherical boundary for binary clustering. They are not connected to each other but they interleave at some points in the X and Y axis.
$\mathbf Mouse ^{\mathbf{8}}$: A two-dimensional data set consists of three connected clusters of different sizes that form the ears and face of Mickey Mouse which is a popular cartoon character.
$\mathbf Pathbased ^{\mathbf{7}}$: A two-dimensional data set consists of 3 unconnected clusters forming 2 spherical shape and an uncompleted circle. The two spherical shapes interleave with the uncompleted circles in the X and Y axis.
$\mathbf Seeds ^{\mathbf{6}}$: Represents 3 different varieties of wheat for 210 elements based on the geometric parameters of wheat kernels.
$\mathbf Smiley ^{\mathbf{10}}$: A two-dimensional data set which forms an outline of the face having circular boundary, two eyes, and a mouth. The face boundary interleaves in the X and Y axis with the eyes and mouth.
$\mathbf Varied ^{\mathbf{5}}$: A two-dimensional data set consists of 3 isotropic Gaussian blobs connected with each other. The blobs are of different densities having spherical shapes.
$\mathbf Vary Density ^{\mathbf{8}}$: A two-dimensional data set consists of 3 gaussian blobs with variable densities.
$\mathbf WDBC ^{\mathbf{6}}$: Presents a cell nuclei digitized image characteristics of 569 breast mass. Binary clustering is performed to detect if the breast mass is malignant or benign.
$\mathbf Wine ^{\mathbf{6}}$: Represents 13 constituents quantities for each instance to detect one of the three types of wine.

Table 1 Data sets properties which show the name, number of clusters, number of points, number of features, IR value, i value, and data set type

An efficient clustering algorithm based on the k-nearest neighbors with an indexing ratio

Abstract

Similar content being viewed by others

A Simple Clustering Algorithm Based on Weighted Expected Distances

A Multi-objective Approach for Inter-cluster and Intra-cluster Distance Analysis for Numeric Data

cs-means: Determining optimal number of clusters based on a level-of-similarity

Explore related subjects

1 Introduction

2 Related work

3 Nearest Point with Indexing Ratio (NPIR)

3.1 Preliminaries

Definition 1

Definition 2

3.2 NPIR description and steps

3.3 A simple example

3.4 Complexity analysis

4 Experiments and results

4.1 Evaluation measures

4.2 Data sets

4.3 Sensitivity analysis

4.4 Results and discussion

4.4.1 Quantitative analysis

4.4.2 Qualitative analysis

4.5 Mall customer segmentation case study

5 Conclusion and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation