Keywords

1 Introduction

Clustering refers to grouping objects into meaningful clusters so that the elements of a cluster are similar, whereas they are dissimilar in different clusters. Data clustering is a very useful technique used in many fields, such as biology, spatial data analysis, business, and others. Moreover, clustering methods can be used during the process of designing various neural networks [1, 2], fuzzy and rule systems [7, 9, 20, 29], and creating some algorithms for the identification of classes [12]. A variety of large collections of data brings a great challenge for clustering algorithms, so a lots of new different clustering algorithms and their configurations are being intensively developed, e.g. [11, 13, 14]. It should be noted that there is no clustering algorithm which creates the right clusters for all datasets. Moreover, the same algorithm can also produce different results depending on the input parameters applied. Therefore, cluster validation should be also used to assess the results of data clustering. So far, a number of authors have proposed different cluster validity indices or modifications of existing ones, e.g., [10, 23, 26, 27]. Generally, clustering algorithms can be divided into four categories including partitioning, hierarchical, grid-based, and density-based clustering. For example, the well-known partitioning algorithms are, e.g. K-means, Partitioning Around Medoids (PAM) [4, 32] and Expectation Maximization (EM) [18], whereas the hierarchical clustering includes agglomerative and divisive approaches, e.g. the Single-linkage, Complete-linkage or Average-linkage or DIvisive ANAlysis Clustering (DIANA) [19, 22]. Then, the grid-based approach includes methods such as e.g. the Statistical Information Grid-based (STING) or Wavelet-based Clustering (WaveCluster) [21, 25, 30]. The next category of clustering algorithms is the density-based approach. The Density-Based Spatial Clustering of Application with Noise (DBSCAN) is the most well-known density-based algorithm [8]. However, it is seldom used to cluster multidimensional data, but now the original DBSCAN has also many various extensions, e.g. [3, 5, 6, 17, 24, 31]. This algorithm requires two input parameters, i.e. the eps and MinPts. Determination of these parameters is very difficult, but the right choice of those parameters is a fundamental issue. In literature, some methods have been proposed to determine these parameters, e.g. [16].

In this paper, a new approach to determining the eps parameter is proposed. It is based on an analysis of abrupt changes in the distances between each element of the dataset and its k-th nearest neighbor. This paper is organized as follows: Sect. 2 presents a detailed description of the DBSCAN clustering algorithm. In Sect. 3 the new method to determine the eps radius is outlined while Sect. 4 illustrates experimental results obtained on datasets. Finally, Sect. 5 presents conclusions.

2 The Description of the DBSCAN Algorithm

In this section, the basic concept of the DBSCAN algorithm is described. As mentioned above, it is a very popular algorithm because it can find clusters of arbitrary shapes and requires only two input parameters, i.e. the eps (the radius of the neighborhood) and the MinPts (the minimum number of points required to form a dense region). To understand the basic concept of the algorithm several terms should be explained. Let us denote a dataset by X, where point \(p\in X\). The eps is usually determined by the user and the right choice of this parameter is a key issue for this algorithm. The MinPts is the minimal number of neighboring points belonging to a so-called core point.

Definition 1

The eps-neighborhood of point \(p \in X\) is called \(N_{eps }(p)\) and is defined as follows: \(N_{eps} \left( p \right) = \left\{ {q \in X|dist(p,q) \le eps} \right\} \), where dist(pq) is a distance function between p and q.

When a number of points belonging to the eps-neighborhood of p is greater or equal to the MinPts, p is called the core point.

Definition 2

Point p is directly density-reachable from point q with respect to epsilon and the MinPts when q is a core point and p belongs to the eps-neighborhood of q.

When point p is directly density-reachable from point q and a number of points belonging to the eps-neighborhood of p is smaller than the MinPts, p is called a border point.

Definition 3

Point p is a noise if it is neither a core point nor a border point.

Definition 4

Point p is density-reachable from point q with respect to the eps and the MinPts when there is a chain of points \(p_1, p_2,...,p_n\), \(p_1=q\), \(p_n=p\), so that \(p_{i+1}\) is directly density-reachable from \(p_i\)

Definition 5

Point p is density-connected to a point q with respect to the eps and the MinPts when there is point o, so that p and q are density-reachable from point o.

Definition 6

Cluster C with respect to the eps and the MinPts is a non-empty subset of X, where the following conditions are satisfied:

  1. 1.

    \(\forall p,q\): if \(p \in C\) and q is density-reachable from p with respect to the eps and the MinPts, then \(q \in C\).

  2. 2.

    \(\forall p,q \in C\): p is density-connected to q with respect to the eps and the MinPts.

The DBSCAN algorithm creates clusters according to Definition 6. At first, point p is selected randomly and if \(|N_{eps}(p)|\ge MinPts\), then point p will be the core point and will be marked as a new cluster. Next, the new cluster is expanded by the points which are density-reachable from p. This process is repeated until no more cluster are found. On the other hand, if \(|N_{eps}(p)|\) \(< MinPts\), then point p will be considered as a new noise. However, this point can be included in another cluster if it is density-reachable from some core point.

Fig. 1.
figure 1

An example of a 2-dimensional dataset consisting of four clusters.

3 The New Approach to Determining the Radius of the Neighborhood

The right choice of the eps (the radius of the neighborhood) is a key issue for the right performance of the DBSCAN algorithm. It is a very difficult task and usually, a distance function is used to solve this problem. This distance function is denoted by the \(k_{dist}\) and it calculates distances between each element of the X dataset and its k-th nearest neighbor. The number of the nearest neighbors is the k parameter. For instance, Fig. 1 shows an example of a 2-dimensional dataset consisting of four clusters. The clusters contain 138, 119, 127, and 116 elements, respectively. First, the \(k_{dist}\) function is used to determine all distances between each element of the X dataset and its k-th nearest neighbors. Next, the results are sorted in an ascending order. Figure 2 presents the sorted results for the k=4. It can be observed that there is a sharp change of the distances along the distance curves, i.e. values of the distances increase significantly. This place is called the “knee” and it can be used to determine the right value of the eps parameter of the DBSCAN algorithm. However, even when the “knee” is found correctly, the determination of the eps parameter is still difficult.

Fig. 2.
figure 2

Sorted values of the \(k_{dist}\) function with respect to \(k=4\) for the example dataset.

Fig. 3.
figure 3

Sorted values of the \(k_{dist}\) function with respect to \(k=4\) and ten identical intervals between the \(s_{start}\) and \(s_{stop}\) points.

Fig. 4.
figure 4

Ten identical intervals between the \(s_{start}\) and \(s_{stop}\) points: a1, ..., a10.

A new approach is proposed to solve this problem. This method is a modification of the approach presented in the article [28] and it consists of a few steps. Let us denote a set of all the sorted values of \(k_{dist}\) function by \(S_{dist}\) for the X dataset. As mentioned above, the eps parameter depends on the “knee” occurring in the sorted distances (in an ascending order) for the given k parameter. It should be noted that in Fig. 2 the values of the \(k_{dist}\) function increase very abruptly when they are to the right of the “knee”. This means that there are elements of the dataset located outside clusters and they can be interpreted as noise. Moreover, the “knee” usually appears at the end of the sorted values of the \(k_{dist}\) function and its size depends on the properties of the dataset. Among the sorted values, it is possible to define a range, which indicates the knee more precisely. It can be determined by the \(s_{start}\) and \(s_{stop}\) values which are calculated as follows:

$$\begin{aligned} \begin{aligned} s_{start} = \left| {S_{dist} } \right| - \left| X \right| \\ s_{stop} = \left| {S_{dist} } \right| \quad \quad \ \end{aligned} \end{aligned}$$
(1)

where \(\left| {S_{dist} } \right| \) is the number of the elements of \(S_{dist}\) and \(\left| X \right| \) is the number of the elements of the X dataset. Furthermore, the \(s_{stop}-s_{start}\) range is divided into ten equal parts, i.e.: a1, a2, ... a10. The size of such part is \(n=\left| X \right| /10\). For example, Fig. 3 shows the sorted values of the \(k_{dist}\) function with respect to \(k=4\) and ten identical intervals between the \(s_{start}\) and \(s_{stop}\) points. Moreover, in Fig. 4 the ten identical intervals between the \(s_{start}\) and \(s_{stop}\) points are presented more precisely. Next, for the first seven intervals, the arithmetic means are calculated which is expressed as follows:

$$\begin{aligned} \begin{aligned} v_i = \frac{{k_{dist} \left( {x_i } \right) + k_{dist} \left( {x_i + n} \right) }}{2} \end{aligned} \end{aligned}$$
(2)

where \(i=1, ...,7\), n is a constant value and it equals the number of \(k_{dist}(x_i)\) values occurring in each part, i.e.: a1, ..., a10. \(x_i\) is the parameter of \(k_{dist}(x_i)\) function (see Fig. 4) and so \(x_1\) equals the component x of the \(s_{start}\) point located at the start of the \(a_1\) interval. Furthermore, \(x_2\) is equal to \(x_1+n\), \(x_3\) is equal to \(x_2+n\), and so on. However, for intervals a8, a9 and a10, the arithmetic means are calculated as follows:

$$\begin{aligned} \begin{aligned} v_j = \frac{{k_{dist} \left( {x_j } \right) + k_{dist} \left( {x_{10}} \right) }}{2} \end{aligned} \end{aligned}$$
(3)

where \(j=8, 9\) and 10. \(x_{10}\) equals component x of the \(s_{stop}\) point located at the end of the \(a_{10}\) interval. In Fig. 5 the average values of all the intervals are presented. These average values (see Eq. (2) and Eq. (3)) are used to calculate the eps parameter. It should be noted that the “knee” can be analyzed by these calculated average values. First, two factors \(v_{7:1}\) and \(v_{8:7}\) are computed and they can be expressed as follows:

$$\begin{aligned} v_{7:1} = \frac{{{v_7 - v_1} }}{{2}} \quad v_{8:7} = \frac{{{v_8 - v_7} }}{{2}} \end{aligned}$$
(4)

These factors play a key role in the analysis of the “knee”. For instance, when the values of \(k_{dist}\) increase very slowly in a1, ..., a7, average value \(v_{7:1}\) does not change significantly, either. It means that the “knee” can be quite wide. Furthermore, if the values of the \(k_{dist}\) function increase abruptly in a8, ..., a10, then average values \(v_{8:7}\) will have a large value. Thus, these factors can be used to calculate the eps parameter and it can be expressed as follows:

$$\begin{aligned} eps = v_{8 - 7} - v_{7 - 1} \end{aligned}$$
(5)

It should be noted that \(v_{8-7}\) is close to the right value of the eps parameter. However, the \(v_{7-1}\) is very important because it shows how the distances increase in the “knee” and it is used to correct the value of \(v_{8-7}\) (see Eq. (5)).

Fig. 5.
figure 5

Average values calculated for all intervals (i.e. a1, ..., a10).

In the next section, the results of the experimental study is presented to confirm the effectiveness of this new approach.

4 Experimental Study

In this section, several experiments have been conducted on 2-dimensional artificial datasets using the original DBSCAN algorithm. It should be noted that this algorithm can recognize clusters with arbitrary shapes. In these conducted experiments are used artificial datasets that include clusters of various sizes and shapes. Moreover, parameter k (i.e. MinPts) is equal to 4 in all the experiments and the visual inspection is used for the evaluation of the accuracy of the DBSCAN algorithm. The described new method is used to automatically determine the eps parameter.

Fig. 6.
figure 6

Examples of 2-dimensional artificial datasets: (a) \({ Data\; \mathrm 1}\), (b) \({ Data\; \mathrm 2}\), (c) \({ Data\; \mathrm 3}\), (d) \({ Data\; \mathrm 4}\), (e) \({ Data\; \mathrm 5}\), (f) \({ Data\; \mathrm 6}\), (g) \({ Data\; \mathrm 7}\), (h) \({ Data\; \mathrm 8}\) and (i) \({ Data\; \mathrm 9}\).

Table 1. A detailed description of the artificial datasets

It can be noted that the DBSCAN algorithm is rarely used to cluster multidimensional data due to the so-called “\(curse\; of \;dimensionality\)”. However, different modifications of this algorithm have been proposed to solve the problem (e.g. [5]).

4.1 Datasets

Nine 2-dimensional datasets are used in the experiments and they are called Data 1, Data 2, Data 3, Data 4, Data 5, Data 6, Data 7, Data 8 and Data 9, respectively. Table 1 shows a detailed description of these datasets. It should be noted that they contain varied numbers of elements and clusters (from 2 to 8 clusters). Moreover, the shapes and sizes of the clusters are also different. In Fig. 6, these datasets are presented. It can be observed that the distances between the clusters are very different, some of the clusters are very close and others quite far. For instance, in Data 3 the elements create five clusters with different sizes, Data 7 contains elements which create Gaussian, square, triangle, and wave shapes, and Data 8 is so-called the spirals problem, where points are on two entangled spirals.

4.2 Experiments

In this section, the evaluation of the performance of the new method to automatically specify the eps parameter is presented. As mentioned above, the eps parameter is very important because the DBSCAN algorithm bases on this parameter to create the right clusters. It is usually determined by visual inspection of the sorted values of the \(k_{dist}\) function. On the other hand, the new method described in Sect. 3 allows us to determine this parameter in an automatic way. The second parameter of the DBSCAN, i.e. the MinPts is also important but it is often chosen experimentally. In all the conducted experiments the MinPts equals 4. Such a choice of the parameter guarantees the creation of various clusters with different numbers of elements. In these experiments the 2-dimensional artificial datasets are used, i.e.: \({ Data\; \mathrm 1}\), \({ Data\; \mathrm 2}\), \({ Data\; \mathrm 3}\), \({ Data\; \mathrm 4}\), \({ Data\; \mathrm 5}\), \({ Data\; \mathrm 6}\), \({ Data\; \mathrm 7}\), \({ Data\; \mathrm 8 }\) and \({ Data\; \mathrm 9}\) sets. Thus, when the eps parameter is specified by the new method, the DBSCAN is used to cluster the artificial datasets. Figure 7 shows the results of the DBSCAN algorithm, where each cluster is marked with different signs. It should be noted that despite the fact that the differences of distances and shapes between clusters are significant, all the datasets are clustered correctly by the clustering algorithm. Moreover, the data elements classified as the noise are marked with a circle, and their number is small in all the datasets.

Fig. 7.
figure 7

Results of the DBSCAN clustering algorithm for 2-dimensional datasets: (a) \({ Data\; \mathrm 1}\), (b) \({ Data\; \mathrm 2}\), (c) \({ Data\; \mathrm 3}\), (d) \({ Data\; \mathrm 4}\), (e) \({ Data\; \mathrm 5}\), (f) \({ Data\; \mathrm 6}\), (g) \({ Data\; \mathrm 7}\), (h) \({ Data\; \mathrm 8}\) and (i) \({ Data\; \mathrm 9}\)

5 Conclusions

In this paper, a new method is proposed to determine the eps parameter of the DBSCAN algorithm. This method bases on the \(k_{dist}\) function, which computes the distance between each element of a dataset and its kth nearest neighbor. Furthermore, the calculated distances are sorted in an ascending order to find out the knee. Next, the distances creating the knee are divided into several intervals and they are used to calculate the mean values (see Eq. (4)). This makes it possible to calculate the right value of the eps parameter. In the conducted experiments, several 2-dimensional datasets were used, where the number of clusters, sizes, and shapes varied within a wide range. From the perspective of the conducted experiments, this method for computing eps is very useful. All the presented results confirm a high efficiency of the newly proposed approach.