Keywords

1 Introduction

Clustering is one of the most widely used methods and one of many effective data analysis techniques for unsupervised learning. It is a helpful approach that attempts to assemble the input data set with respect to any similarities into a list of finite ranges of semantically compatible groups. These algorithms, especially hierarchical algorithms, density-based algorithms, partitioned algorithms, graph-based algorithms, combinational algorithms, model-based algorithms and grid-based algorithms, can be loosely categorized into seven groups. Density-based algorithms are noted for their concise description and hence the comparatively straightforward implementation of these varieties of algorithms. Another vital feature of this algorithm is that, even in an outlier data set, it is able to discover clusters of varying shapes and different sizes and does not require users to determine the number of clusters. DBSCAN defines denser area clusters and low density regions are classified as noise or outliers, providing high-quality output that depends on two specified parameters, Eps and MinPts. The basic principle behind such a cluster finding algorithm is that the neighbor points of the specified radius Eps would consist of the smallest number of points (MinPts) for each point in the cluster. An item is automatically picked by DBSCAN and evaluated only once. The neighborhood of the object is analyzed in such a manner that a cluster to which objects can be added later is created if it satisfies the minimum requirement for forming a cluster, and if the neighborhood objects do not fulfil the lowest threshold criterion, it is declared a noisy object. Many clustering tools are unable to classify clusters of arbitrary shapes, so DBSCAN has the bonus of being able to recognize a cluster of arbitrary shapes. The drawbacks of DBSCAN include the complexity of the worst case, tending to O(n2). Additionally, spatial indexing methods do not work efficiently for higher dimensional data, the runtime complexity grows from O (n logn) to O (n2) for high dimensions. To overcome these issues, the SuperDBSCAN algorithm which is a combination of the SuperCube based Accelerated DBSCAN algorithm along with PCA (Principal Component Analysis) is proposed. This algorithm runs with a temporal complexity of O (n log n), even for higher dimensional data and uses a unique combination of a virtual grid, which is imposed on the input data and employs representative points, this significantly reduces the number of comparisons that need to be made, which translates into a significant run time speed up of up to 52.57% when compared to other proposed improvements. The PCA algorithm attempts to derive a low-dimensional set of features from a much larger set while still preserving as much variance as possible and also visualizing higher dimensional data. Another criticism of the DBSCAN algorithm has been that it is sensitive to its input parameters and MINPTS. The Super DBSCAN algorithm eliminates the need for the MINPTS parameter thus making it more accessible to non-expert users. The major contribution of this article is summarized as follows:

  1. 1.

    To reduce High dimensional to low dimensional data.

  2. 2.

    To identify and classify similar density clusters.

  3. 3.

    It maintains 100% accuracy of the original DBSCAN algorithm

  4. 4.

    It achieves a significant speed up in run time when compared to other improvements proposed for the DBSCAN algorithm.

  5. 5.

    It eliminates the need for the MINPTS input parameter, making the algorithm more user-friendly for non-expert users.

2 Literature Survey

DBSCAN was originally introduced by Ester and Kriegel, which was used as an underlying algorithm in many applications. A large amount of work has been done in the area of DBSCAN like parallel computing, dimensionality reductions, cloud computing, etc. to enhance and increase the accuracy and efficiency of DBSCAN clustering method. A research had been carried out to develop an algorithm which could improve the density calculations on the data set and thereby decreasing accuracy loss.

  • [1] Algorithm makes use of a ranked retrieval technique called WAND in order to improve the results of DBSCAN clustering. It further works by reducing the invoking times of WAND. [2] Another approach introduces in the modification of DBSCAN algorithm that was P- DBSCAN, used for the detection of geo tagged photos using the concept of adaptive density for fast convergence towards high density regions. In this method, the density threshold is also used as one of the factors for efficient working of DBSCAN. [3] Another methodology has been done in Linear DBSCAN calculation dependent on area delicate hashing. The fundamental territories of this paper incorporate including more info boundaries and lacking of over-simplification. Not at all like the first DBSCAN, this procedure utilizes LSH which requires quicker locale inquiry for k neighbors of an information point. This paper incorporates a seed point determination strategy, which depends on impact space and neighborhood closeness, to choose some seed focuses rather than all the area during bunch development is utilized in this work. Therefore, the quantity of district inquiries can be diminished, in this way guaranteeing better calculation precision. [4] An equal DBSCAN bunching calculation was likewise executed in past works for treatment of the huge scope information preparing utilizing the huge information handling stage called Spark. Flash, as another age of quick broadly useful motor for huge scope information handling, gives strong circulated dataset reflection for information stockpiling that dispenses with the requirement for middle outcomes to be shipped off the disseminated document frameworks, and thus it improves continuous information preparing. [5] presents the dimensionality decrease strategies and their appropriateness for different kinds of information and application zones. In this paper they have clarified the Linear Dimension Reduction Technique (LDRTs) and Non-Direct Dimension Reduction Technique (NLRDTs) which thusly have a few administered, solo and semi-managed methods to accomplish low dimensionality. When done a similar report among LDRT and NLRDT can presume that LDRT requires less calculation force and cost. [6] Myat Cho Mon Oo and Thandar Thein have proposed a proficient prescient examination framework for high dimensional enormous information by improving adaptable irregular forest calculation on Apache Spark stage. The viability of the proposed framework is inspected on five-genuine world datasets and results showed that the proposed framework accomplishes profoundly serious execution contrasted and RF calculation actualized by Spark MLib [7]. In this paper, a strategy for coordinating dull information is proposed and Nonnegative grid factorization is applied to the bunching gathering model dependent on dim information. From the start the distinctive base grouping results are acquired by utilizing different bunching designs and NMF is applied to get coordinated outcomes which fills in as appeared in the underneath graph. Test results show that the strategy beats other bunching group procedures [8]. This examination presents an exhaustive investigation of DBSCAN calculation and the improved adaptation of DBSCAN calculation with its usage utilizing mat lab. In this cycle, the information has been isolated impeccably, at that point new testing strategy will applied so as to decrease the thickness of thick information and get information with just a single thickness circulation (meager information), the consequences of examining were viable. They have utilized three strategies to get the normal yield which resembles the figure underneath. [9] Ll Meng’ao and MENG Dongxue, *GU Songyuan, LIU Shufen proposed a paper so as to conquer the time cost of the calculation dependent on lattice cell based calculation. This strategy has been checked tentatively that DBSCAN calculation dependent on matrix cells shows higher exactness and lower time multifaceted nature. [10] In this paper the specialists have proposed a method that decreases the time unpredictability of dbscan calculation contrasted with the first calculation with a period intricacy of O(N). They have proposed a calculation with LSH where It looks for the surmised Nearest Neighbor Points rather than the exact Nearest Neighbor Points and LSH restores a few purposes of the circle whose middle is p and the range is (1 + ε)d. ε is doled out by the clients and ε > 0. [11] This paper which focuses on a new method to solve dimensionality problems where clustering is integrated with correlation measure to produce good feature subset. Evaluated computational time and accuracy of proposed method with Relief and IG methods. Relief (Kira and Rendell, 1992) present a feature selection approach based on instance based attribute ranking scheme called RELIEF algorithm. It deals with incomplete, noisy and multiclass datasets. [12] This paper clarifies about another closeness measure that represents excess of information that are dissipated a similar way from a given point. The run-time assessment shows that both the separation based and common neighbor based thickness gauges have a similar quadratic time unpredictability as for dataset size. Examination of commotion impacts showed that both DBSCAN and CI-DBSCAN have a similar power to the clamor. [13] The archive joins thickness tops grouping and gravitational pursuit strategy to upgrade information bunching which proposes an altered component of choosing the cut-off separation.

  • [2] The introduced strategy chooses the enhanced starting bunch communities then the age of the underlying populace is executed dependent on the arrangement of applicant groups. They give an improved instrument on boundary choice which will expand the exhibition of grouping. [14] This paper clarifies about building up another DBSCAN calculation for assessing the quantity of groups by advancing a probabilistic cycle, in particular DBSCAN-Martingale, which includes arbitrariness in the choice of thickness boundaries and lessen the quantity of emphases needed to separate all bunches. This work fundamentally centers around logical recipe for the normal number of separated bunches per DBSCAN-acknowledgment by outlining with the time administrator and age in thickness based grouping. [15] The archive clarifies about another upgraded DBSCAN-Martingale probabilistic cycle calculation for assessing the quantity of groups, which includes arbitrariness in the choice of thickness boundaries and lessen the quantity of emphases needed to remove all bunches. The tests acknowledged in this work, in the covering bunch altering issue one can decrease the quantity of between group edges by covering at least two bunches. At that point, it might be smarter to embed vertex in more than one group than to eliminate edges from that vertex.

Grouping calculations are alluring for the assignment of class recognizable proof in spatial information bases. In any case, the notable calculations experience the ill effects of serious downsides when applied to huge spatial information bases, most of which could be the expanding time unpredictability and finding the groups for high dimensional information.

3 Design

The idea aims at finding a better and suitable technique for the clustering purpose in regard with DBSCAN. The propose a method to reduce high dimensional data to low dimensional data and thereby reducing the time complexity. The aim of this work is to design such an algorithm which could be used to increase the efficiency of finding clusters of high dimensional datasets using the DBSCAN approach. Also, it reduces the time complexity for the same. It defines all the necessary modules used for this approach of clustering, for example, with respect to the data analysis for handling high dimensional data is done using PCA technique which projecting the high dimensional data in a space for lower dimensional subspace, data pre-processing techniques involves outlier detection, handling missing values of attributes, redundant data and similar density data. The initial step in our work includes the data pre-processing. Data pre-processing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that the pre-process our data before feeding it into our model. Standardize our dataset in such a manner that obtain equal values for the given data, that is convert the strings to integers (Fig. 1).

Fig. 1.
figure 1

Proposed framework for the efficient clustering

Unsupervised Learning a Machine Learning technique that has been pre-owned in our idea. This technique has various applications such as segmenting datasets by some shared attributes. Detecting anomalies that do not fit to any group, and simplifying datasets by aggregating variables with similar attributes. The objective of clustering is to find different groups within the elements in the data. To do so, clustering algorithms find the structure in the data so that elements of the same cluster (or group) are more similar to each other than to those from different clusters. Ensuing the data pre-processing takes place the data is split into two sets: the training and the testing data. It is based on a number of points with a specified radius ε and there is a special label assigned to each datapoint. The process of assigning this label is the following:

  • It is a specified number (MinPts) of neighbour points. A core point will be assigned if there is this MinPts number of points that fall in the ε radius.

  • A border point will fall in the ε radius of a core point, but will have less neighbours than the MinPts number.

  • Every other point will be noise points.

4 Implementation and Results

Usage of a novel methodology known as SuperCube based Accelerated Density Based Spatial Clustering for Applications, in this algorithm. Here, our algorithm executes with a time complexity of O(nlogn), even for higher dimensional data. It unstratified a virtual cube on the given data set and uses the idea of representative points to reduce the number of comparisons. It provides significant run time speed up when compared to other proposed improvements and eliminates the need for one input parameter MINPTS, thus making it easier to use for naive users.

A series of steps have been followed to obtain accurate results. Commencing the process of data processing, followed by merging conditions and then the superimposing supercube to choosing the representation points.

Pre-processing

Perform a pre-processing step on the input data and sort the data set according to one of the dimensions. For example, a two dimensional data set is first sorted according to the X coordinates and then the result is sorted according to the Y coordinates. Hence this approach of data sorting is extensible for any dimension. This pre-processing speeds up the supercube allocation as explained below.

The Merging Condition

As explained below the algorithm checks if the distance between two points is less than the input parameter. If it is, then the two supercubes that these two points belong to are merged to belong to the same cluster.

Superimposing Super Cubes

One of the key ideas of this proposed algorithm is the formation of the Supercubes. These Supercubes are contributory in providing us an execution speed improvement via the original DBSCAN algorithm. A supercube is an n-dimensional analogue of a square (n = 2) and a cube (n = 3). A Supercube has all the properties that a cube has in three dimensional space, but in n dimensional space, it is a closed, compact, convex figure whose 1-skeleton consists of groups of opposite parallel line segments aligned in each of the space’s dimensions, perpendicular to each other and of the same length. So in 1-D a Supercube is a line segment, a square in 2-D, a cube in 3-D, a tesseract in 4-D and so on. Supercubes for the first four dimensions are shown in Fig. 3. Based on the dimensionality of the data, overlay a virtual Supercube on the points such that the length of the space diagonal of the Supercube is.

Deciding the area in space where the 2-D grid needs to be constructed. For this define the boundary by taking the minimum and maximum of X and Y coordinates respectively. Once the scope of the grid is defined, construct the boxes and superimpose this grid on the original data set. The key idea for the construction of this grid is that, construct it in such a way that every point in a particular box is guaranteed to belong to the same cluster. This provides us a significant speed up, as instead of checking each and every point against each point in the data set as in the original DBSCAN algorithm, just need to check if any one point in the box satisfies the merging condition, if it does all the points belonging to that particular box are guaranteed to satisfy the merging condition too. This property is guaranteed by the virtue of creation of the boxes. Choose the diagonal of each box to be equal to the parameter that is input to the algorithm, thus create boxes of length = breadth = /√2. Therefore, the maximum distance between any two points inside the box is never greater than, consequently they belong to the same cluster. This approach scales with dimensions. For two dimensional data set construct a grid consisting of flat boxes, for three dimensional data, create a grid of cubes where the length of the space diagonal is equal to and so on.

Choosing Representative Points

For n dimensional data, need 2(n + 1) representative points. For sake of explanation, consider two dimensional data. Hence for each face of the grid, define eight Representative Points. These points are labelled as follows:

  • Top

  • Top Right

  • Right

  • Bottom Right

  • Bottom

  • Bottom Left

  • Left

  • Top Left

These points are the boundary points of the box.

For example, the Top point is the top-most point inside the box. Use a token ring approach to distribute points to these eight positions which is explained here. To decide which box does the point belong to, divide the point by /√2 in each dimension to obtain the corresponding band in which the point lies. The intersection of these bands gives us the box in which the point lies. In case the grid does not start from the origin, perform a origin shift transformation. If any points are found to lie exactly on the edge of the box round up the division. As the first point is entered into a new box, all of these eight positions are initialized to that point. An ideal position with respect to representative points is defined as a case where representative point lie on the edge of the box as shown in Fig. 2. Now whenever a new point is encountered, each position calculates the Euclidean distance between the ideal position and the current point and the point being examined. If a new point is found to have a smaller distance, the corresponding representative point is updated with the new point. Obviously multiplicity is allowed, that is one point can represent multiple positions. This process is repeated for all the points belonging to a particular box.

Fig. 2.
figure 2

The distance between the new point and the ideal position is compared with the distance between the representative point and the ideal position, if the former is lesser the new point is updated as the new representative point.

Depth First Search

Now to implement the main clustering; traverse the entire grid in a depth first fashion. Start with the box closest to the shifted origin. At any given point of time, compare the two closest representative points between boxes. For example, in a two-dimensional space, begin by checking if the top representative point of the current box i is within an - distance of the Bottom representative point of the top neighbor box j. Traversal in a clockwise fashion. If the distance between corresponding points is less than, the merging condition is said to be satisfied and all the points in both these boxes are labeled to belong to the same cluster. If the merging condition fails, the next box in the DFS traversal is checked and the cluster IDs are not updated.

Layering

Two points belong to the same cluster if the distance between them is less then, the location of the points inside the boxes are not known. Therefore, it is entirely possible that the distance between points in two consecutive boxes can be less. That is, consider only neighboring boxes in our depth first traversal, the j-layer; bound to miss points that are at a distance which is less than. Consequently, the accuracy of the algorithm suffers. To overcome this problem, introduce a concept of layering. If the traversal for the j-layer box returns a failure, check for the (j + 1) th layer box in the same direction, subject to the condition that the distance between the representative points is less than. Check the (j + 1)th layer box only for non-diagonal neighbors which again reduces the number of iterations. This can be done because the grids are designed in such a way that the diagonal of each box is equal to the value of, therefore two points in the diagonal direction cannot be at a distance less than and not lie in consecutive boxes.

Density based clustering techniques especially DBSCAN have found innumerable applications across domains. The Traditional DBSCAN algorithm has a time complexity of O(n2) which reduces to O (n log n) when spatial indexing. However, the complexity again rises to O(n2) for high dimensional data. Here, proposed a new algorithm, Super DBSCAN, to overcome the drawbacks of DBSCAN. This algorithm runs with a temporal complexity of O (n log n) which uses a unique combination of a virtual grid, that is imposed on the input data and employs representative points which reduces the number of comparisons that need to be made and in turn reduces the time complexity. Our projected algorithm is an easy but efficient and effective algorithm and it boosts the performance of DBSCAN in adjacent clusters with different densities.

figure a
figure b
figure c
figure d
figure e

To check the time and the efficiency of the proposed SCA-DBSCAN algorithm, conduct multiple experiments. To verify the results, run the DBSCAN algorithm and the FastDBSCAN algorithm on the same data sets with the same input parameters and compare the accuracy of the results as well as the time taken by each algorithm to achieve those results. Run the algorithm on three different data sets that represent three different types of data: sparsely distributed data (synthetic data set 1), tightly coupled data and data with arbitrary shape that classical partition based algorithms fail to identify (synthetic data set 2). After performing the proposed algorithm on the obtained data sets can observe the following results: DATA SET 1.

On comparing both the algorithms, can conclude that the number of clusters formed are 5 with an accuracy of 0.166 for the original DBSCAN while, the proposed algorithm gives us an accuracy of 0.85714 and the time taken is 0.722 s with the number of points obtained individually for each cluster is 50, which results in a tightly coupled data as shown in Fig. 3.

Fig. 3.
figure 3

Output for dataset 1

On comparing both the algorithms the number of clusters formed is 1 with an accuracy of 0.0667 for the original DBSCAN while the proposed algorithm gives us an accuracy of 1.0 and the time taken is 0.3517 s with the number of points obtained is 116, which results in a sparsely distributed data as shown in Fig. 4.

Fig. 4.
figure 4

Output for dataset 2

On comparing both the algorithms the number of clusters formed is 0 with an accuracy of 0.0 for the original DBSCAN while the proposed algorithm gives us an accuracy of 0.9714 and the time taken is 1.062 s with the number of points obtained is 1, which results in a sparsely coupled data and as shown in Fig. 5.

Fig. 5.
figure 5

Output for dataset 3

On comparing both the algorithms the number of clusters formed is 3 with an accuracy of 0.156 for the original DBSCAN while the proposed algorithm gives us an accuracy of 0.8 and the time taken is 0.627 s with the number of points obtained as 50 for cluster 1, 2 and 3 while cluster 4 consists of 42 points, which results in an arbitrary size and shaped data and as shown in Fig. 6.

Fig. 6.
figure 6

Output for dataset 4

5 Conclusion

Established an algorithm that centralizes around the drawbacks of a common machine learning technique known as DBSCAN. Through this user a naive user can competently lay hands on understanding the over comings of DBSCAN through a transparent algorithm that have implemented. Conducted a basic comparison with the supercube DBSCAN algorithm and the original DBSCAN where the accuracy and the time complexity are displayed clearly. The main concept of the DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density. So it would suggest that while dealing with spatial clusters of different density, size and shape, it could be challenging to detect the cluster of points. The task can be even more complicated if the data contains noise and outliers. DBSCAN tends to be slower when working with large datasets since computation on similar datasets becomes time consuming as the datasets increase. Our research shows how it could be improvised or modified in a way that it identifies outliers easily with anomaly detection and also handles the large datasets thereby reducing the time complexity and also reducing the dimensionality. Aim at incorporating Dimensional reduction so as to increase the cluster performance for our work. Our work proposes a simple yet improvised version of DBSCAN which not only reduces the time complexity but increases the effectiveness and efficiency of the clustering altogether.