Keywords

1 Introduction

Clustering algorithms discover naturally occurring structures in datasets. Nowadays, extensive collections of data pose a great challenge for clustering algorithms. So, many researchers create different new clustering algorithms or modify existing approaches [5, 6, 11, 19, 21]. It is worth noting that data clustering is applied in various areas, e.g. biology, spatial data analysis, or business. The key issue is the right choice of input parameters because the same algorithm can produce different results depending on applied parameters. This problem can be resolved by using different cluster validity indices, e.g., [10, 26, 29, 30]. Generally, clustering algorithms can be divided into four categories: partitioning, hierarchical, grid-based, and density-based clustering. Well-known partitioning algorithms include the K-means or Partitioning Around Medoids (PAM) [3, 32]. The next clustering category called hierarchical is based on an agglomerative or divisive approach, e.g. the Single-linkage, Complete-linkage, Average-linkage, or Divisive ANAlysis Clustering (DIANA) [17, 23]. On the other hand, the grid-based approach creates a grid of cells for a dataset, e.g. the Statistical Information Grid-based (STING) or Wavelet-based Clustering (WaveCluster) methods [18, 28, 31]. The last category can be represented by the Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm [9] which has many modifications [7, 12, 13, 15, 27]. This algorithm can discover clusters of an arbitrary shape and size but requires two input parameters, i.e. the eps and the MinPts. The determination of these parameters is very important for the DBSCAN algorithm to work properly. It is important to note that clustering methods can be used during the process of designing various neural networks [1, 2], fuzzy, and rule systems [4, 8, 14, 16, 20, 22, 24, 25].

In this paper, a new approach to determining the eps and MinPts parameters is proposed. It is based on the creation of a proper grid of cells and the grid is used to define the values of the two parameters. This paper is organized as follows: Sect. 2 presents a description of the DBSCAN clustering algorithm. In Sect. 3 the new method for determining the parameters is outlined, while Sect. 4 illustrates the experimental results on datasets. Finally, Sect. 5 presents the conclusions.

2 The DBSCAN Algorithm

The concept of the DBSCAN algorithm is presented in this section. As mentioned above, this algorithm is very popular, because it can find clusters of arbitrary shapes and requires only two input parameters, i.e. the eps and the MinPts. The eps is usually determined by the user and it has a large influence on the creation of clusters. The next parameter, i.e. the MinPts is the minimal number of neighboring points belonging to the so-called core point. Let us denote a dataset by X, where point \(p\in X\). The following definitions (see [9]) will be helpful in understanding how the DBSCAN algorithm works.

Definition 1:

The eps-neighborhood of point \(p \in X\) is called \(N_{eps}(p)\) and is defined as follows: \(N_{eps} \left( p \right) = \left\{ {q \in X|dist(p,q) \le eps} \right\} \), where dist(pq) is a distance function between p and q.

Definition 2:

p is called the core if the number of points belonging to \(N_{eps}(p)\) is greater or equal to the MinPts.

Definition 3:

Point q is directly density-reachable from point p (for the given eps and the MinPts) if p is the core point and q belongs to \(N_{eps}(p)\).

Definition 4:

if point q is directly density-reachable from point p and the number of points belonging to \(N_{eps}(q)\) is smaller than the MinPts, q is called a border point.

Definition 5:

Point q is a noise if it is neither a core point nor a border point.

Definition 6:

Point q is density-reachable from point p (for the given eps and the MinPts) if there is a chain of points \(q_1, q_2,...,q_n\) and \(q_1=p\), \(q_n=q\), so that \(q_{i+1}\) is directly density-reachable from \(q_i\)

Definition 7:

Point q is density-connected to point p (for the given eps and the MinPts) if there is point o such that q and p are density-reachable from point o.

Definition 8:

Cluster C (for the given eps and the MinPts) is a non-empty subset of X and the following conditions are satisfied: first, \(\forall p,q\): if \(p \in C\) and q is density-reachable from p, then \(q \in C\), next \(\forall p,q \in C\): p is density-connected to q.

The DBSCAN algorithm creates clusters according to the following: at first, point p is selected randomly if \(|N_{eps}(p)|\ge MinPts\), than point p will be the core point and a new cluster will be created. Next, the new cluster is expanded by the points which are density-reachable from p. This process is repeated until no cluster is found. On the other hand, if \(|N_{eps}(p)|\) \(< MinPts\), then point p will be a noise, but this point can be included in another cluster if it is density-reachable from some core point.

Fig. 1.
figure 1

An example of a 2-dimensional dataset consisting of four clusters.

3 Grid-Based Approach to Determining the Eps and MinPts Parameters

The right choice of the eps and \( MinPts\) parameters is a fundamental issue for the high performance of the DBSCAN algorithm. The proposed method is based on a uniform grid of cells which is created for a dataset. In order to provide a clearer explanation of this new approach, an example of a 2-dimensional dataset is generated. Figure 1 shows this dataset consisting of 1200 elements located in four clusters, i.e. 200, 250, 300 and 450 elements per cluster, respectively. Next, for this dataset an example grid of cells can be created, e.g. consisting of 100 cells (10 x 10). Figure 2 shows this uniform grid of cells. It can be noted that the proper grid can be used to define the value of the eps parameter, but the key issue is an appropriate choice of the size of the grid, which has a big influence on the value of the parameter. In this new method, a way of solving this problem is proposed and it consists of a few steps. First, several grids of cells are created, where the size of rows and columns of grids change in a wide range, i.e. from 2 to 90 (2 x 2 and 90 x 90 cells). So, the number of cells is changed from 4 to 8100. Such a number of cells gives precise information about the properties of a dataset. Let us denote the size of a grid by \(G_{size}\). For all the created grids, three ranges can be defined as in the following:

Fig. 2.
figure 2

Uniform grid consisting of 100 cells (10 x 10) for the example dataset.

$$\begin{aligned} \begin{aligned} \begin{array}{*{20}c} {range1} \quad {for} \quad {2 \le \;\;G_{size} \; \le \;30} \\ {range2} \quad {for} \quad {30< \;G_{size} \; \le \;60} \\ {range3} \quad {for} \quad {60 < \;G_{size} \; \le \;90} \\ \end{array} \end{aligned} \end{aligned}$$
(1)

It is worth noting that the second parameter of the DBSCAN algorithm, i.e. the \( MinPts\) is also very important and it affects a number of so-called noise data. Generally, the choice of this parameter is often realized individually depending on a dataset, but very often the MinPts equals 4, 5, or 6. Such values of this parameter ensure a good compromise between the size of clusters and an amount of noise data in most cases. So, in this new approach, the values of the MinPts are selected from 4 to 6. As mentioned above, the sizes of the grids range from 2 to 90. Next, in all the created grids are found cells which include only 4 elements. Then, the grid which includes a maximum number of cells with four elements is found and the size of the grid is noted by \(G_{max4}\). Furthermore, the grids which include a maximum number of cells with 5 and 6 elements are also found and the sizes of grids are noted by \(G_{max5}\) and \(G_{max6}\). In the next step, the \(dist_4\), \(dist_5\) and \(dist_6\) parameters are determined for the \(G_{max4}\), \(G_{max5}\) and \(G_{max6}\) grid sizes, respectively. The values of these parameters are maximum distances between the elements of the cells which include 4, 5, and 6 elements, respectively. Next, if condition \((G_{max4}> G_{max5} > G_{max6})\) is fulfilled, the value of the eps is defined as follows:

$$\begin{aligned} \begin{aligned} eps = \left\{ {\begin{array}{*{20}c} {a*dist_4 \quad for\quad G_{max4} \in range1 } \\ {b*dist_5 \quad for\quad G_{max4} \in range2 } \\ {c*dist_6 \quad for\quad G_{max4} \in range3 } \\ \end{array}} \right. \end{aligned} \end{aligned}$$
(2)

where factors a, b and c are experimentally determined and their values are 1, 1.2 and 1.5, respectively. On the other hand, the MinPts is expressed as follows:

$$\begin{aligned} \begin{aligned} MinPts = \left\{ {\begin{array}{*{20}c} {4 \quad for\quad G_{max4} \in range1 } \\ {5 \quad for\quad G_{max4} \in range2 } \\ {6 \quad for\quad G_{max4} \in range3 } \\ \end{array}} \right. \end{aligned} \end{aligned}$$
(3)

Sometimes, for different datasets condition \((G_{max4}> G_{max5} > G_{max6})\) may not be fulfilled. This means that the clusters have a different density because when the MinPts increases and the clusters have a similar density, the maximum number of cells should be decreased. In these cases, when the condition is not fulfilled the values of the a, b, and c factors should be increased so that they equal 2. In Table 1 are presented the values of \(G_{max4}\), \(G_{max5}\) and \(G_{max6}\) calculated for the example dataset. It can be observed that for the MinPts equal to 5, the size of the grid is larger than the size for the MinPts equal to 4. So, clusters are of different density in the dataset (see Fig. 1). Condition \((G_{max4}> G_{max5} > G_{max6})\) is not fulfilled and the b parameter is increased (equals 2). Moreover, when the MinPts is equal to 4, \(G_{max4}\) is 52 and is included in range2. Thus, \(eps=b*dist_5\) (see Eq. 2) and the values of the eps and MinPts parameters are 0.20 and 5, respectively. Such values of input parameters are used in the DBSCAN algorithm.

Table 1. Values of \(G_{max4}, G_{max5} \ and \ G_{max6} \) for the example dataset
Fig. 3.
figure 3

Results of the DBSCAN clustering algorithm for the example dataset.

Figure 3 shows the results of the DBSCAN clustering algorithm for the example dataset. In the next section, the results of the experimental tests are presented to confirm the effectiveness of the new approach.

4 Experimental Results

In this section, several experiments have been conducted on 2-dimensional artificial datasets. In these experiments, the DBSCAN algorithm is used to cluster the data. As mentioned above, the eps and MinPts parameters play a very important role in creating correct clusters by this clustering algorithm. So, they are defined based on the new method described in Sect. 3 and the calculated values of these parameters are presented in Table 3. Moreover, the evaluation of the accuracy of the DBSCAN algorithm is conducted by a visual inspection. It is worth noting that the artificial datasets include clusters of various shapes and sizes. On the other hand, for clustering multidimensional datasets, determining the input parameters of the DBSCAN algorithm is very difficult.

Table 2. A detailed description of the artificial datasets
Table 3. The eps and MinPts values used by the DBSCAN algorithm

4.1 Datasets

In the conducted experiments nine 2-dimensional datasets are used. Most of them come from the R package. The artificial data are called Data 1, Data 2, Data 3, Data 4, Data 5, Data 6, Data 7, Data 8 and Data 9, respectively. They consist of a various number of clusters, i.e. 2, 3, 4, 5, and 6 clusters. The scatter plot of these data is presented in Fig. 4. As it can be observed on the plot, the clusters are located in different areas and some of the clusters are very close to each other and the others are quite far apart. For instance, Data 1 is a so-called spirals problem, where the points are on two entangled spirals, in Data 5 the elements create a Gaussian, square, triangle and wave shapes and Data 6 consists of 2 Gaussian eyes, a trapezoid nose and a parabola mouth (with a vertical Gaussian one). Moreover, the sizes of the clusters are different and they contain a various number of elements. In Table 2 is shown a description of these datasets.

Fig. 4.
figure 4

Examples of 2-dimensional artificial datasets: (a) \({ Data\; \mathrm 1}\), (b) \({ Data\; \mathrm 2}\), (c) \({ Data\; \mathrm 3}\), (d) \({ Data\; \mathrm 4}\), (e) \({ Data\; \mathrm 5}\), (f) \({ Data\; \mathrm 6}\), (g) \({ Data\; \mathrm 7}\), (h) \({ Data\; \mathrm 8}\) and (i) \({ Data\; \mathrm 9}\).

4.2 Experiments

The experimental analysis is designed to evaluate the performance of the new method to specify the eps and MinPts parameters. As mentioned above, these parameters are very important for the DBSCAN algorithm to work correctly. In standard approaches, they are determined by a visual inspection of the sorted values of a function which computes a distance between each element of a dataset and its k-th nearest neighbor. The new approach described in Sect. 3 is based on finding a proper grid of cells and it makes it possible to determine these two input parameters. In these experiments, the nine 2-dimensional datasets used are called \({ Data\; \mathrm 1}\), \({ Data\; \mathrm 2}\), \({ Data\; \mathrm 3}\), \({ Data\; \mathrm 4}\), \({ Data\; \mathrm 5}\), \({ Data\; \mathrm 6}\), \({ Data\; \mathrm 7}\), \({ Data\; \mathrm 8 }\) and \({ Data\; \mathrm 9}\) datasets. It is worth noting that the value of the MinPts parameter is also chosen when the size of the grid changes from 2 to 90 (2 x 2 and 90 x 90 cells). Then, when these parameters are specified by the new method, the DBSCAN algorithm can be used to cluster these datasets. Figure 5 shows the results of the DBSCAN algorithm, where each cluster is marked with different signs. The data elements classified as the noise are marked with a circle. Thus, despite the fact that the differences in the distances and the shapes between clusters are significant, all the datasets are clustered correctly by the DBSCAN. Moreover, a number of the data elements classified as noise in all the datasets is relatively insignificant.

Fig. 5.
figure 5

Results of the DBSCAN clustering algorithm for 2-dimensional datasets: (a) \({ Data\; \mathrm 1}\), (b) \({ Data\; \mathrm 2}\), (c) \({ Data\; \mathrm 3}\), (d) \({ Data\; \mathrm 4}\), (e) \({ Data\; \mathrm 5}\), (f) \({ Data\; \mathrm 6}\), (g) \({ Data\; \mathrm 7}\), (h) \({ Data\; \mathrm 8}\) and (i) \({ Data\; \mathrm 9}\)

5 Conclusions

In this paper, a new approach is proposed to calculate the eps and MinPts parameters of the DBSCAN algorithm. It is based on finding the right grid of cells, which is selected from many other grids. As mentioned above, the sizes of the grids change from 2 to 90. It is worth noting that the determination of the MinPts parameter is also difficult and it is often chosen empirically depending on datasets being investigated. In this new method, the values of the MinPts parameter are selected from 4 to 6. Generally, the right grid of cells makes it possible to correctly calculate these two input parameters. In the conducted experiments, several 2-dimensional datasets were used, where a number of clusters, sizes, and shapes were very different. All the presented results confirm the high efficiency of the newly proposed approach.