1 Clustering algorithms

A clustering analysis is a data analysis tool sorting various objects into clusters in a way that similarity of two objects belonging to one group is maximal whereas similarity with objects outside this cluster is minimal [14]. The difference between clustering and classification is that when classifying, objects are sorted into classes known before whereas when clustering, the classes are not known before, but they are the result of clustering. In most cases, the basic way of data representation for clustering is the use of n × p data matrix X. This matrix contains n objects, where matrix rows represent individual objects and p columns represent their properties.

Clustering methods can also be divided into various groups according to various criteria. The most commonly used division is according to the resulting cluster structure, namely hierarchical and non-hierarchical. The results in the hierarchical methods are clusters sorted in a hierarchical structure. Each cluster is represented by one hierarchical tree. Individual tree nodes represent clusters. Hierarchical methods can also be divided according to the approach of cluster creation—agglomerative and divisive. The agglomerative approach stems from individual objects which gradually form clusters until all objects are in a single cluster. The divisive approach takes a set as one unit which then forms a hierarchical subsets system by gradual dividing the objects. Apart from this basic division, there are numerous other groups of clustering methods, which can be divided into the following categories based on their clustering approach. Classification of individual methods into these categories is not completely strict, it often varies according to the author, some methods can be classified into more categories, respectively. The most frequent classification is as follows [11]:

  • Partitioning methods Construct various partitions and then evaluate them by some criterion.

  • Hierarchical methods Create a hierarchical decomposition of the set of data using some criterion.

  • Density-based methods They are based on connectivity and density functions.

  • Grid-based methods They are based on a multiple-level granularity structure.

  • Model-based methods A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other.

  • Others.

Table 1 presents an overview of the most frequent clustering algorithms. Methods which we will further deal with are in bold. They have been selected so that they belong to different categories of clustering methods.

Table 1 Overview of the selected most frequent clustering algorithms [19, 28]

1.1 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm proposed in 1996 [10]. It is a typical representative of clustering according to density. Density-based clustering take grounds from the definition of a cluster as an area with a higher spatial density of points compared with other areas. It holds that the surroundings of an object in a cluster must locate a minimum number of other objects (MinPts) in the distance defined by radius ε. The advantages of density-based methods include noise recognition of any shape and robustness to outlying values.

The DBSCAN algorithm works in the following steps:

  1. 1.

    Select a point from the observed set.

  2. 2.

    Find all points within reach of the selected point. If the distance of MinPts points from the selected point is smaller than ε, such a point is a core point and create a new cluster around it.

  3. 3.

    Find all objects directly reachable from the core point. Clusters can be merged.

  4. 4.

    Finish if no other object can be added to any cluster, otherwise continue with step 1.

Both output ε and MinPts parameters are crucial for cluster creation as their selection significantly affects the clustering result.

1.2 K-means

K-means is a non-hierarchical clustering algorithm proposed in 1967 [22]. This algorithm assumes that clustered objects can be understood as points in a certain Euclidean space and the number of clusters k is given in advance. The number of clusters influences a random selection of initial cluster centers–centroids, which are points in the same space as clustered objects. Objects are sorted to a cluster whose centroid they are the closest. In the following step, a new centroid is defined based on mean values of objects belonging to it. The algorithm is repeated until the position of centroids stabilizes.

The K-means algorithm works in the following steps:

  1. 1.

    Choose the number of clusters k, generate or set cluster centroids.

  2. 2.

    Assign each object to a cluster with the smallest Euclidean distance to the centroid. If the selected cluster does not equal to the initial cluster, relocate there the current object and recalculate the modified centroid.

  3. 3.

    Finish if no cluster has been changed, otherwise continue with step 2.

A limitation of the K-means methods is the possibility of working with metric data and the possibility of the existence of more solutions based on the initial object layout.

1.3 CLARA

Method CLARA (Clustering LARge Applications) [17] is based on the use of sampling and method PAM (Partitioning Around Medoids), which is one of numerous modifications of the K-means method. Unlike K-means, PAM consists in substituting a centroid for a medoid, which is a value dividing a series of ascending order results into the same halves. Another improvement is the possibility of using arbitrary metrics.

The CLARA method is modified in a way to process a larger data amount compared with PAM. When clustering with CLARA, a random sample from the processed data is selected. The sample is then clustered into k clusters. The rest of the objects are then put into those clusters. Sampling is based on a random choice of a small number of objects which are then a subject to the application of the PAM method. Then, objects which are not part of the sample are assigned. This is performed several times and the best result is then selected.

Both input parameters, Number of samples and Size of samples, are crucial for clustering as their selection influences the clustering result.

1.4 CURE

The essence of the CURE (Clustering Using REpresentatives) algorithm consists in the fact that each cluster has a finite number of representatives. First, a random selection of objects is performed; second, they are divided into fractions. Each fraction is subjected to a hierarchical cluster analysis. Then, outliers are identified and based on the formed auxiliary clusters, a required number of final clusters is created [33]. Algorithm CURE is an effective algorithm for large datasets.

figure a

Both input parameters, shrinking factor α ∈ 〈0,1 and Number of repetitions, are crucial for the creation of clusters as their selection affects the clustering result.

2 Self-organizing maps

Self-Organizing Maps (SOM) belong to a group of self-learning neural networks with a teacher. [18]. It concerns a two-layer network. The number of neurons contained in the input layer determines the dimension of the input data. The second layer, i.e., the output layer, is organized into a certain topological structure—most commonly into a two-dimension grid with neurons laid into a square or hexagonal structure. A self-learning algorithm stems from the strategy of competitive learning. When learning, we gradually show the network individual training patterns. Having shown a training pattern, the neurons compete, and the winning neuron is that whose Euclidean distance from the shown pattern is the smallest. The weights of the winning neuron are then adjusted to be as close as possible to the pattern. The degree of weight adjustment is given by learning parameter μ ∈ (0,1〉, Other neurons’ weights are, in addition, adjusted by their membership to the surroundings of the winning neuron, which is determined by the radius ρ.

figure b

2.1 Ways of representing SOM results

  • U-Matrix is created from a learned self-organizing map and shows the Euclidean distance of a neuron from its directly neighboring neurons using coloring in a 2D map or as a 3D map. Using U-Matrix, we are able to divide a self-organizing map into areas corresponding to related input data and to set their boundaries clearly.

  • Activation map depicts the frequency of neuron activation on the output map. It is usually in grayscale while the most frequently activated neuron is depicted in black and the least activated in white.

3 Literature overview

This literature research stems from the focus of this article and deals with hybrid approaches to clustering using the SOM method.

In [34], the authors proposed a new approach to clustering and visualization of students' cognitive structural models. They used SOM in combination with Ward’s clustering to conduct cluster analysis. The authors [16] presented a new algorithm where they use SOM to achieve rough clusters and center of clusters, while they use the K-means method to finish the clustering. Work [32] deals with the classification of objects moving in a video into three classes: pedestrians, bicycles, and vehicles. The created feature vectors are sent to SOM, whose output are three cluster centers initializing the cluster centers of K-means. Method K-means then terminates the clustering. The authors in [31] presented an adaptive approach which uses the Kohonen Self-Organizing Map, extended with online K-means clustering, for classification of real-time input sensor data for further processing. Article [9] presented a new clustering algorithm SOM +  + , which combines methods K-means and SOM. Method K-means is used here to initialize the weight values on neuron synapses in the SOM method. The authors stated that the proposed approach SOM +  + brought better results than SOM in all criteria. The authors in [1] proposed and experimentally verified a new hybrid algorithm for image segmentation based on SOM and Extended Fuzzy C-Means (EFCM) named self-organizing-map based extended fuzzy c-means (SEEFC). The authors in [12] proposed a new hybrid approach based on a combination of self-organizing map and hierarchical clustering, while they discovered genes displaying expression regulation during characteristic stages of M. truncatula flower and pod development. Work [6] presented a new hybrid method of SOM and DBSCAN called SOM-DBSCAN for image segmentation. Work [27] dealt with a breast-cancer data analysis for survivability studies and prediction. The authors proposed a hybrid approach based on SOM, DBSCAN, and MLP (multilayer perceptron). SOM grouped patients with similar characteristics. DBSCAN then created 9 clusters, where the patients have a different survivability time. Finally, the survivability prediction accuracy for each group was improved by using MLP.

Most works in the area of clustering algorithms deal with the experimental comparison of individual clustering methods on various datasets. Works where the authors developed hybrid algorithms taking grounds in a combination of various clustering methods are rare. We have not found any works that would present a hybrid combination of individual clustering algorithms with SOM as well as have been evaluated on benchmark datasets.

Therefore, it is this issue that is dealt with in this article. We focused on the possibility of combining the SOM method with other clustering methods—CLARA, CURE, and K-means, which we consider the main contribution of this article.

4 SOM-based clustering methodology

This work presents our proposed methodology based on SOM for analyzing the number of clusters in datasets. Figure 1 depicts how it works. The input data is preprocessed (normalized) in order to create a training set for SOM. Next, we will propose a suitable network topology (dimensions of the output layer in 2D), initialize the learning parameter α with the surroundings of the winning neuron ρ, and randomly generate the values of the network weights wij. SOM adaptation follows. The results of the learning process are neuron coordinates and their distances to the neighboring neurons. Prior to entering the selected clustering algorithm, we eliminate neurons that do not have any assigned object. According to the clustering algorithm, we initialize its parameters and perform the calculation itself, i.e., creation of clusters according to the selected algorithm. It then works with the preprocessed data and its performance is more effective compared with its outputs on unprocessed data, which is proved by the carried out experimental study.

Fig. 1
figure 1

Proposed methodology

Method SOM is primarily useful in the first phases of the process when the knowledge of the data is too vague. An advantage of SOM, as a tool to reduce and cluster data, consists in its ability to structure data topologically based on mutual connections and their projection in a 2D map.

5 Experiments

The comparison of individual clustering methods used the benchmark dataset Fundamental Clustering Problems Suite (FCPS) [29]. The FCPS can be downloaded from the following website: https://www.uni-marburg.de/fb12/arbeitsgruppen/datenbionik/data. This repository contains a set of benchmark problems that test the limits of clustering algorithms. A description of the used dataset is given in Table 2, which provides the following parameters for each tested set: dimension, number of clusters, number of objects, and clustering problem they represent.

Table 2 Used selected test problems from the FCPS dataset [29]

5.1 Metrics

We used the metrics of the Rand index, which expresses the level of similarity between two data clusters and which is often used to compare the performance of various clustering algorithms Table 3.

Table 3 Optimal parameters setting and Rand index

Definition [23]: Given a set of n elements \(S = \left\{ {o_{1} ,~ \ldots ,~o_{n} } \right\}\) and two partitions of S to compare,\(Y = \left\{ {Y_{1} ,~ \ldots ,~Y_{r} } \right\}\), a partition of S into r subsets, and\(Z = \left\{ {Z_{1} ,~ \ldots ,~Z_{s} } \right\}\), a partition of S into s subsets, define the following:

  • a, the number of element pairs in S which belong to the same subset in Y and to the same subset in Z

  • b, the number of element pairs in S which belong to different subsets in Y and to different subsets in Z

  • c, the number of element pairs in S which belong to the same subset in Y and to different subsets in Z

  • d, the number of element pairs in S which belong to different subsets in Y and to the same subset in Z

The Rand index, R is defined as (3):

$$ R = \frac{{a + b}}{{a + b + c + d}} $$
(1)

The Rand index can be perceived as a percentage ratio of right decisions made by the algorithm [23]. The following formula is used to compute it (4):

$$ R = \frac{{{\text{TP + TN}}}}{{{\text{TP + FP + FN + TH}}}}, $$
(2)

where TP represents the number of true positives, TN represents the number of true negatives, FP represents the number of false positives, and FN represents the number of false negatives.

5.2 Experimental validation

Each method was tested on selected data repeatedly. It namely concerned 100 runs for methods that can give a different result based on the initial object division. Methods that are not sensitive to object arrangement were run several times as well for the purposes of finding suitable parameters setting. For each method, we provided the successfulness rate of correct placing of objects into the clusters with respect to pattern classes using the Rand index, Eq. (2).

Table 4 provides an overview of the results for individual methods with optimal parameter setting, with the best-achieved result in bold for a given tested set. Optimal parameters setting for each used method is shown in Table 3. The most successful on this dataset was method DBSCAN, which achieved a 100% rating in five sets. Method CURE achieved 100% successfulness in three datasets, while it was better than DBSCAN in one case. Method CLARA achieved 100% successfulness in two datasets, while it was more successful than DBSCAN in one of them. Methods CLARA and CURE were not able to cope with linear not separable clusters. Unlike DBSCAN, both methods were able to create better clusters for set Engytime, method CLARA primarily.

Table 4 Experimental outcomes (noise is depicted in green)

Table 5 states an overview of result for the SOM method. For each tested set, a U-matrix and an activation map for the output neuron layer is depicted. For the purposes of better visibility, the created clusters were highlighted using a black line. There is a clear ability of the SOM method to determine the number of clusters. The clusters were formed by neurons with the smallest distance to their topological neighbors (in the U-matrix depicted in a blue hue). In the activation map, there is a number of assigned objects for each neuron. The parameters of the SOM method for individually tested sets are stated in Table 6. For individual sets, the size of the output layer and the value of the learning parameter (αstart) were determined experimentally based on the complexity of the solved problem.

Table 5 Setting the parameters of the SOM method
Table 6 U—matrix and activation map of the outputs of the SOM method

6 Experimental outcomes and the contribution of the proposed methodology

The outcomes of the SOM method were subsequently processed by other clustering methods. The outcomes of the SOM methods are neurons coordinates and their distances t the neighboring neurons. In order to achieve better results in further clustering, it was necessary to remove neurons that do not have any assigned object. This resulted in better cluster separation. The parameters of the testing methods are provided in Table 7. The K-means method only requires setting the number of clusters. This parameter is set by the user according to the given data set. The outcomes of the experimental study are presented in Table 8, which states the averages values of the Rand index from 100 measurement runs.

Table 7 Setting the parameters of the methods
Table 8 Rand index—summary of the outcomes combined with the SOM method

Method DBSCAN can be considered the most successful out of the compared methods (it achieved Rand index 1 in most cases). That is why we did not combine it with the SOM method. However, this method is very sensitive to the initialization of parameters. One of the possible ways how to find the initial setting of the parameter ε is to use the SOM method; specifically for the U-matrix, where the parameter ε has the value ranging from the smallest to the biggest distance to the neighboring neurons.

All of the used methods combined with the SOM method achieved better results than when used separately. The methods were able to solve problems which would otherwise be unsolvable for them separately, or they would not solve them correctly. The combination of SOM and other clustering methods led to the most significant improvement in the set called Atom, which no method was able to solve. This set was solved at 100% successfulness rate by a combination of SOM + CLARA, SOM + K-means; 99% by SOM + CURE, respectively. However, no combination with SOM led to a solution for the set called Chainlink. It was caused by the fact that the clusters are linear not separable, which also remained in the output SOM layer. Concerning the set Engytime, the successfulness rate was comparable with that achieved by using the methods separately.

7 Comparison with other academic works

In this chapter, we focused on published academic works in the area of clustering algorithms, where the proposed approached were experimentally verified on the FCPS datasets [29] and the outcomes were evaluated according to the Rand index, Eq. (2). The chart in Fig. 2 depicts the average values of the Rand index achieved by methods published by other authors and by the methods proposed in this work. The red color shows values achieved by our improved hybrid clustering algorithms, i.e., SOM + K-means, SOM + CURE, and SOM + CLARA. The chart shows that the proposed approached overcame, in most cases, the experimental outcomes published in other academic works [2,3,4,5, 7, 8, 13, 15, 20, 21, 24,25,26, 30].

Fig. 2
figure 2

Average Rand index values. The red color shows values achieved by the proposed method, i.e., SOM + K-means, SOM + CURE, and SOM + CLARA (color figure online)

8 Conclusions and future work

Most works from the area of clustering algorithms deal with an experimental comparison of individual clustering methods on various datasets. Works where the authors also proposed a hybrid algorithm stemming from a various combination of clustering algorithms are rare. This article deals with this issue.

The proposed methodology increases the efficiency of clustering methods, which was proved by the performed experimental study. This is the main contribution of our work. We focused on the possibility of combining method SOM (Self-Organizing Maps) with other clustering methods—CLARA, CURE, and K-means. Method SOM is primarily useful in the first phases of the process when knowledge of the data is too vague. The use of the selected clustering algorithm follows. The algorithm then works with preprocessed data. Its performance, compared with the outcomes on unprocessed data, is more effective, which is proved by the performed experimental study on the benchmark dataset Fundamental Clustering Problems Suite (FCPS). Part of the experimental verification was also a comparison of the achieved results with other approaches using this dataset based on standard metrics—the Rand index. The chart in Fig. 2 shows that in most cases, our proposed approach overcame the outcomes published in other academic works.

With respect to our future work, we would like to focus on another improvement of the efficiency of clustering algorithms, primarily combined with algorithms from the area of soft computing. Regarding SOM, we would like to propose an algorithm which could estimate suitable sizes of the output layer as well as define its suitable dimension (1D, 2D, or 3D …).