A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Agarwal, Parul; Mehta, Shikha; Abraham, Ajith

doi:10.1007/s00500-021-05973-1

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Optimization
Published: 21 June 2021

Volume 25, pages 10237–10256, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Soft Computing Aims and scope Submit manuscript

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Download PDF

432 Accesses
14 Citations
Explore all metrics

Abstract

Subspace clustering is one of the efficient techniques for determining the clusters in different subsets of dimensions. Ideally, these techniques should find all possible non-redundant clusters in which the data point participates. Unfortunately, existing hard subspace clustering algorithms fail to satisfy this property. Additionally, with the increase in dimensions of data, classical subspace algorithms become inefficient. This work presents a new density-based subspace clustering algorithm (S_FAD) to overcome the drawbacks of classical algorithms. S_FAD is based on a bottom-up approach and finds subspace clusters of varied density using different parameters of the DBSCAN algorithm. The algorithm optimizes parameters of the DBCAN algorithm through a hybrid meta-heuristic algorithm and uses hashing concepts to discover all non-redundant subspace clusters. The efficacy of S_FAD is evaluated against various existing subspace clustering algorithms on artificial and real datasets in terms of F_Score and rand_index. Performance is assessed based on three parameters: average ranking, SRR ranking, and scalability on varied dimensions. Statistical analysis is performed through the Wilcoxon signed-rank test. Results reveal that S_FAD performs considerably better on the majority of the datasets and scales well up to 6400 dimensions on the actual dataset.

Analyzing Subspace Clustering Approaches for High Dimensional Data

Efficient hybrid algorithms for density based subspace clustering to deal with density divergence for improved quality and conciseness

Article 30 October 2019

ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data

Article 22 April 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

High-dimensional data mean data with numerous features. High-dimensional data exist in various domains like recommendation systems, microarray data, social media data, and many more. Its rapid growth is seeking the attention of researchers and scientists for the last 2 decades (Steinbach et al. 2003; Kriegel et al. 2009). Clustering is a technique of finding groups of similar data based on their attributes. Effective clustering of the high-dimensional dataset is an important research issue in the field of data mining (Abualigah et al. 2020; Abualigah 2019). It has many challenges. Traditional clustering algorithms like K-Means, DBSCAN, OPTICS, etc., (Fahad et al. 2014) perform clustering in full-dimensional space. These algorithms attempt to find the cluster using all attributes given for each object of data. However, it becomes computationally expensive to apply these algorithms in the case of a large number of attributes/dimensions. This problem is called the “curse of dimensionality” (Steinbach et al. 2004). One of the reasons for this problem is that distance measure loses its importance as data points are sparse in high dimensional space. Clusters in such high-dimensional space remain hidden under few relevant dimensions. Such relevant dimensions i.e., subsets of features are called subspaces. Irrelevant dimensions and noise completely mask the true clusters. Thus, classical clustering algorithms fail to determine clusters lying in different subspaces. One of the efficient ways of performing clustering in high-dimensional data is subspace clustering.

Data mining research communities have given a number of techniques to perform clustering in high-dimensional data (Assent 2012; Abualigah et al. 2021a, b). To determine clusters lying in different subsets of dimensions, subspace clustering algorithms (Domeniconi et al. 2004; Parsons et al. 2004) are employed. Subspace clustering determines a similar group of objects in a set of relevant dimensions of the dataset. Subspace clustering algorithms are broadly classified into two categories: Hard subspace and soft subspace (Deng et al. 2016). Hard subspace determines precise subspaces for various clusters. However, subspaces can be overlapping on various clusters. Müller et al. (2009a, b) divided hard subspace clustering algorithms into three categories: cell-based algorithms, density-based algorithms, and clustering-oriented based algorithms. In order to obtain better subspace clusters, researchers developed a number of algorithms in the literature. Some of the relevant work is discussed below.

The evolutionary algorithm has been incorporated to form effective subspace clusters (Agarwal and Mehta 2014, 2017; Abualigah et al. 2021a, b). The first study in this field was made by Sarafis et al. (2003). Authors inculcated genetic operators for determining subspaces in the subspace clustering algorithm. The experiment is performed on 80-dimensional datasets but not compared with existing algorithms. In Lu et al. (2011) introduced a technique for soft projected clustering of high-dimensional data using particle swarm optimization algorithm (PSO). The algorithm addresses the problem of variable weighting in the projected clustering approach. The algorithm is evaluated up to a 2000-dimensional dataset. Timmerman et al. (2013) presented a new variant of the subspace k-means algorithm for shaping clusters in all dimensions. The proposed algorithm was assessed against k-means, factorial k-means, mixture factor analysis, and reduced k-means. The evaluation was made on adjusted rand_index and cluster variance on a 9-dimensional dataset only. Lin et al. (2014)suggested an evolutionary approach for determining subspaces. In order to improve clustering quality, the algorithm uses a hybrid genetic algorithm where local search is performed by PSO. The efficacy of the algorithm is assessed upon an error rate of up to 13-dimensional problems. Kaur and Datta (2015) presented an extended version of the SUBSCALE algorithm. The algorithm is examined based on F_Score up to 6144-dimensional dataset. However, no result of 6144-dimensional dataset is depicted in the paper. In Kumar et al. (2016) projected the clustering algorithm to address the issues of big data. The algorithm (clusiVat) is compared with four existing algorithms on the basis of rand_index and time for 500-dimensional data only. A survey on nature-inspired algorithms with evolutionary strategies and applications is illustrated in Agarwal and Mehta (2014). A comprehensive analysis of nature-inspired algorithms is shown in Agarwal and Mehta (2017). Comparative analysis of these algorithms on clustering is depicted in Agarwal and Mehta (2015). To improve clustering quality, an enhanced version of the flower pollination algorithm is also employed (Agarwal and Mehta 2016). Zhong and Pun (2020) have given a subspace algorithm to find similarities in data points and perform feature selection. The algorithm normalizes the column values and does not find overlapping clusters. The maximum dimensions evaluated were 6000. Yan et al. (2020) have proposed a multi-view subspace clustering algorithm which is an extended version of the K-means algorithm. Clusters formed are mostly of spherical shape and include all samples hence maximal subspaces along with overlapping clusters and noisy points could not be found. An extended version of high dimensional data clustering using multivariate t-distribution is given by Pesevski et al. (2018). The algorithm is evaluated on low dimension datasets i.e., maximum 8 dimensions. A novel subspace clustering technique for large number of samples is given by Liu et al. (2020). Though algorithm discovers clusters in low time complexity, yet overlapping clustering could not be found. Another low rank subspace clustering algorithm is given by Zhao et al. (2019). Maximum dimensions evaluated are in hundreds only. Also, the algorithms developed are not compared with conventional subspace clustering algorithms.

Though various subspace clustering algorithms have been developed yet there exist many challenges in finding subspace clusters in high-dimensional data. There is scope for improvement in clustering quality and finding overlapping subspace clusters. Existing algorithms are unable to find maximal subspace i.e., subspaces without redundant information. It should exclude noisy data points from clusters. Existing hard subspace clustering algorithms are not able to deal with high-dimensional data i.e., data with thousands of attributes. Most of the existing studies for high-dimensional clustering have been assessed on few hundreds of attributes only. Additionally, these algorithms require normalization of data in range 0–1 before performing clustering. Normalization is usually min–max type and does not handle outliers properly. Also, existing algorithms have limited capability to find clusters of varied densities. Generally, clusters found in subsets of dimensions are of the same density. This might cause few redundant data points to be included in a cluster or some relevant points left out from the cluster.

The above reasons draw the motivation for developing a new subspace clustering algorithm. This paper is an extended version of work presented in Agarwal and Mehta (2019b). In previous work (Agarwal and Mehta 2019b), the subspace algorithm presented is integrated with a differential evolution algorithm. However, the algorithm gets stuck in local solutions due to its limited capability to explore the complete search space. Also, it lags in maintaining a judicious balance of exploration and exploitation. One of the reasons is that the parameters of DBSCAN are not well found. These shortcomings are taken care in the presented work. Thorough analysis of the results and statistical analysis substantiates the performance of S_FAD with respect to subspace_DE (subspace with differential evolution) (Agarwal and Mehta 2019b) and other subspace clustering variants.

In this work, a hybrid meta-heuristic subspace clustering algorithm named S_FAD is proposed. In S_FAD, a self-tuned DBSCAN algorithm is used to perform clustering. It begins clustering with one-dimensional data, once clusters in each dimension are formed, their details are stored in the hash table. S_FAD uses the concept of hashing for finding maximal subspace clusters. In each maximal subspace, DBSCAN algorithm is executed to form clusters. It employs a bottom-up subspace search method to determine subspaces. The self-tuned version of the DBSCAN algorithm is introduced in this work, wherein input parameters of the DBSCAN algorithm are automatically determined based on the dataset. This is achieved by a hybrid meta-heuristic algorithm named as FAD algorithm (Agarwal and Mehta 2019a). The FAD is a combination of flower pollination, artificial bee colony, and differential evolution algorithm (named ABC_DE_FP in Agarwal and Mehta 2019a). Experimentally it is observed that the S_FAD algorithm can find clusters up to 6400-dimensional dataset. It overcomes the shortcomings of the bottom-up approach by automatically determining the optimized parameters suitable for a given dataset. It is successful in determining the overlapping subspace clusters. Also, the algorithm does not need any parameter priory such as the number of clusters or number of subspaces. S_FAD eliminates duplicate subspaces by determining the maximal subspaces through hashing. It obtains all possible clusters where each data point participates. Along with it, S_FAD does not normalize the original dataset as done by most of the subspace clustering algorithms and finds arbitrary-shaped clusters of varied densities.

Performance of S_FAD is evaluated on standard artificial and actual datasets and compared with various conventional subspace clustering algorithms. Evaluation parameters for assessing the performance of the S_FAD algorithm are F_Score and rand_index. Using F_Score and rand_index, S_FAD is judged on the following measures: (a) Average ranking, (b) Success rate ratio ranking, (c) Wilcoxon signed-rank test, (d) Scalability in terms of dimensions. Results demonstrate that the proposed algorithm (S_FAD) is able to handle the challenges of subspace and high-dimensional clustering considerably.

The paper is organized under various sections. Section 2 describes the proposed algorithm in detail along with pseudocode. The experimental setup is depicted in Sect. 3. Section 4 shows experimental results and the analysis of algorithms. Section 5 finally concludes the paper.

2 Subspace clustering in high-dimensional data

Clusters in high-dimensional data are mostly present in low dimension data. These dimensions may vary from cluster to cluster. Hence subspace clustering plays an important role in performing clustering in high-dimensional data. Algorithms on subspace clustering explore the subspaces existing in datasets with two different search techniques (Parsons et al. 2004). These are top-down and bottom-up subspace search methods. The top-down approach is an iterative method that starts by finding clusters with full dimensions assuming all dimensions are of equal weights. Thereafter, according to the clusters formed, each dimension is allotted a certain weight. The modified weights of dimensions are used in the successive iteration for generating new clusters. The most crucial input parameter is the size of subspaces which is difficult to determine at an early stage. This approach has the drawback of finding disjoint subspaces of equal size. Some top-down-based search algorithms (Parsons et al. 2004) are PROCLUS, FINDIT, ORCLUS, etc. The bottom-up search method starts clustering from a single dimension. It follows the Apriori principle to reduce the search space. Only those dimensions containing dense units participate in the formation of higher subspaces. This approach finds overlapping clusters and subspace clusters of arbitrary shape. Input parameters used in this approach are density threshold and grid size. CLIQUE, DOC, etc., are subspace algorithms of a bottom-up approach.

2.1 Proposed algorithm (S_FAD)

A novel density-based meta-heuristic subspace clustering algorithm is proposed for clustering high-dimensional data. The naming convention provided to this algorithm is S_FAD (subspace FAD) for ease of representation and comparison. S_FAD algorithm is inspired by Kaur and Datta (2015) and uses a bottom-up strategy of subspace clustering (Parsons et al. 2004) which is based on the Apriori principle (Agrawal et al. 1996). Accordingly, the data points that do not participate in lower-dimensional clustering are eliminated from high-dimensional subspace clustering.

As shown in Fig. 1, the algorithm begins by performing clustering on each attribute of the dataset (taking all tuples) using a self-tuned DBCAN algorithm (explained in the next subsection). Prior to clustering, every data point is allotted a large unique arbitrary integer named as the signature. In the S_FAD, this signature is considered as 15 digits natural number generated randomly for each record/tuple in the dataset. The signature of a cluster is the sum of signatures of data points belonging to the cluster. The large number is considered to avoid identical signatures of clusters. Cluster signatures act as the key to the hash table. The clustered data points with summation of signatures (of respective data points) and dimensions are noted in the hash table. The signature of dense points forms the search key of the hash table. If any entered signatures in the hash table are the same, then attributes are combined in a single entry as shown in Fig. 2. (Signatures are merely used to match the same clusters formed in the various attributes. Instead of matching each data point, signatures are matched.) This step ensures that attributes merged are part of the same subspace. Thereafter rows with the same subspaces/dimensions in the hash table are merged. Thus, relevant maximum subspaces are created with dense points. Thereafter, clustering in each subspace is performed to find dense units by a self-tuned DBSCAN algorithm. Final clusters in high-dimensional datasets are obtained in each subspace.

Pseudocode of the S_FAD is presented in Algorithm 1. Assume a given dataset is X with m tuples and n attributes represented as {d₁, d₂, d₃,…,d_n).

2.2 Self-tuned DBSCAN using FAD

In the proposed algorithm, the DBSCAN (Density-based spatial clustering of applications with noise) algorithm is used to perform clustering in each dimension of the data as well as in maximal subspaces formed in a hash table. The advantage of the DBSCAN algorithm over the partition-based algorithm (Fahad et al. 2014) is that it has the ability to find clusters of arbitrary shapes and detect noisy points. Moreover, it clusters the datasets even without any former information of a number of clusters. Two input parameters used in the DBSCAN algorithm are epsilon (ε) and MinPts (τ). ε is the distance measured between two points to form a neighborhood. MinPts is the minimum number of points to form a cluster in the neighborhood of any points within ε distance. The efficacy of the DBSCAN algorithm largely depends on the input parameters ε and MinPts. These are the sensitive parameters that vary from data to data and are hard to determine priory. Thus, with respect to datasets, the optimized values of these parameters are determined by a proposed meta-heuristic algorithm named FAD algorithm. Hence, DBSCAN is named here as self-tuned DBSCAN as parameters are self-tuned in the algorithm.

FAD is a swarm intelligence-based algorithm which is an amalgamation of flower pollination (FP) (Yang 2012a, b), artificial bee colony algorithm (ABC) (Karaboga and Basturk 2007), and differential evolution (DE) (Storn and Price 1997) algorithm. FAD is a name given to the binary version of ABC_DE_FP algorithm (Agarwal and Mehta 2019a) which was developed for a continuous optimization problem. It was established that ABC_DE_FP performed better as compared to existing meta-heuristic algorithms (Agarwal and Mehta 2019a) on complex benchmark functions. Hence binary version of the algorithm is developed to optimize the parameter values of the DBSCAN algorithm i.e., Minpts and ε.

The FAD algorithm encodes each individual of the population in binary form. Each individual represents MinPts in bit string format (Karami and Johansson 2014). (Since MinPts represent a discrete value hence binary version is applied instead of continuous version). The number of dimensions ‘D’ is the number of bits required to represent a decimal number (MinPts). In FAD, the population of individuals is randomly initialized. Each individual represents the food source. The fitness of the food source is computed through an objective function. Here purity (shown in Eq. (1) is the objective function used to get the best values of MinPts and ε. Purity is an external validation criterion for measuring the quality of clusters formed. Higher the purity, better the Minpts and $\varepsilon$. To compute purity, the most frequent class in a cluster is assigned to that cluster. Thereafter, the correctness of the class assignment to a cluster is determined by counting the number of data points assigned appropriately divided by a total number of points in the dataset ‘N’.

$$ {\text{purity}} = \frac{{\mathop \sum \nolimits_{{j = 1}}^{k} \max _{{1 < i < l}} \left( {{\text{class}}_{i} ~ \cap ~{\text{cluster}}_{j} } \right)}}{N} $$

(1)

where k is the actual cluster number, ith is the class already defined in the dataset and jth is the cluster formed from the clustering algorithm. The numerator term of Eq. (1) signifies that jth cluster has a majority of data points of ith class such that ith class is assigned to jth cluster.

The working of FAD algorithm is divided into three phases: Employed bee, onlooker bee, and scout bee. In the employed bee phase, each food source is updated with mutation (Eq. 2) and crossover strategies (Eq. 3) of differential evolution algorithm:

$$ v_{i} = ~x_{a} + F \cdot \left( {x_{b} - x_{c} } \right) $$

(2)

where $v_{i}$ is the mutant food source, i = 1, 2,…N, x_a, x_b, and x_c are all distinct initial (target) food sources of the same population such that x_a, x_b, x_c ∈ N. Also, i is different from a, b and c. F is a scaling factor. In the present algorithm, F is randomly generated between the uniform distribution range betamin and betamax. Crossover is performed on mutant and target food source. New food source generated is called trial food source $u_{{ij}}$.

$$ u_{{ij}} = \left\{ {\begin{array}{*{20}l} {v_{{ij}} ,~} \hfill & {{\text{if}}\;{\text{~rand}}\left[ {0,1} \right] \le {\text{Cr}}~{\text{or}}~j = = j_{{{\text{rand}}}} } \hfill \\ {x_{{ij}} ,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$

(3)

where j is the selected index of dimension D, j_rand is randomly chosen index from 1 to D. This is made to ensure that resultant vector u_ij receives the minimum single mutant vector $~v_{{ij}}$. Cr is the rate of crossover which controls the choice of target or mutant vectors. Fitness of the updated solution is computed in the form of purity. If the purity value (fitness value) of the new solution is better than solution is updated in the population else its trial value is incremented by 1. Trial represents a counter maintained for each food source in the population. If a food source gets updated in the population, then its respective trial is reset to 0 else it is incremented. After the employed bee phase, food sources are updated in the onlooker bee phase. In this phase, the global and local search of food sources is controlled by switch probability p. Global search is made through the global pollination process of flower pollination algorithm (using Eq. 4)

$$ x_{i}^{{t + 1}} = x_{i}^{t} + L\left( {x_{i}^{t} ~ - ~g_{*} } \right) $$

(4)

where $x_{i}^{t}$ is the ith food source at tth generation, $g_{*}$ is the current best food source and L is a levy flight distribution given in Pavlyukevich (2007), Yang (2012a, b). If the algorithm switches to the local search process, then the mutation strategy of DE given in Eq. (2) is applied. Thereafter crossover technique of DE given in Eq. (3) is applied to obtain the new food source. If the fitness of the food source is improved then it is updated in a population otherwise its trial counter is incremented. Food sources with the highest fitness value (nectar amount) are memorized. If the trial counter of any food source outstrips the Limit value, then that particular food source is discarded and the scout bee explores for a new food source. The process continues unless the termination condition is satisfied. Here, termination condition is maximum iterations.

For calculating the fitness value of any food source, $\varepsilon ~$ is calculated using respective Minpts. Epsilon value (ε) is determined analytically (Daszykowski et al. 2001) from Minpts and data matrix x using the following Eq. (5):

$$ {\text{Eps}} = \left( {\frac{{\left( {\mathop \prod \nolimits_{{i = 1}}^{{\max \left( x \right) - \min \left( x \right)}} i} \right)*k*\Gamma \left( {0.5n + 1} \right)}}{{m\sqrt {\pi ^{n} } }}} \right)^{{1/n}} $$

(5)

where m is the number of tuples and n is the dimension of each tuple of data matrix x. k is the MinPts, $\Gamma$ is a gamma function which generalizes the factorial of a given argument.

FAD algorithm initializes the following input parameters:

Iterations Number of times each individual is updated by an algorithm.
D Number of bits used to represent MinPts.
N Population size
betamin, betamax Range of solution
Trial a counter incremented when a solution does not improve
Limit threshold value for calling scout bees (randomly initializing individual)
p switch probability for selecting local and global search

The flowchart of FAD is shown in Fig. 3. In DE, there are only two update functions of chromosomes i.e., mutation and crossover. While in FAD, there are three phases where mutation and crossover are used for exploitation, levy flight distribution is used for exploration. These steps in FAD algorithms help in maintaining an appropriate balance between the local and global search and hence give better results as compared to DE.

3 Experimental setup

The proposed algorithm (S_FAD) is compared with various well-known subspace clustering algorithms on actual as well as artificial datasets. Existing subspace algorithms (Müller et al. 2009a, b) compared are SCHISM (Sequeira and Zaki 2004), CLIQUE (Road and Jose 1998), MINECLUS (Yiu and Mamoulis 2003), DOC (Procopiuc 2002), INSCY (Assent et al. 2008), SUBCLU (Kailing et al. 2004), FIRES(Kriegel et al. 2005), P3C (Moise et al. 2006), PROCLUS (Aggarwal et al. 1999), and STATPC (Moise and Sander 2008). S_FAD is also compared with the subspace clustering algorithm given by (Kaur and Datta 2015) called SUBSCALE algorithm and (Agarwal and Mehta 2019b) named SUBSPACE_DE. Conventional subspace clustering algorithms are implemented on an extended WEKA toolbox provided by (Müller et al. 2009a, b). This toolbox provides space for executing various subspace clustering algorithms. S_FAD, SUBSCALE, and SUBSPACE_DE algorithms are implemented on MATLAB R2013a. Evaluation metrics, parameter setting and dataset description used for comparison are described in subsections.

3.1 Evaluation metrics

Performance evaluation of the proposed algorithm (S_FAD) against various subspace clustering algorithms is made through classification. The true cluster labels (T) for data items in each dataset are already known. Each algorithm predicts the label for each data item in the dataset. Predicted and true labels for each data item in the dataset form a confusion matrix. This confusion matrix helps in determining various evaluation measures of clustering. In this study, evaluation measures used for testing the performance of S_FAD against various subspace algorithms are rand_index and F_Score. Subsequently average ranks and success rate ratio ranks are computed using F_Score and rand_index. Evaluation measures employed are briefly described as below:

Rand index- It is the evaluation measure for determining the quality of clusters formed by the clustering algorithm. It is defined by the ratio of correctly labeled data items to the total number of data items. Higher the rand_index, betters the algorithm. A good clustering algorithm predicts the cluster that best portrays the true cluster and thus posses’ high-quality clusters.
F Score- This measure defines that the predicted cluster should cover maximum data items from the true cluster and minimum items from other clusters (Müller et al. 2009a, b). F_Score is expressed in Eq. (6):
$$ F~{\text{score}} = \frac{1}{m}\mathop \sum \limits_{{i = 1}}^{m} \frac{{2*{\text{recall}}(T_{i} )*{\text{precision}}(T_{i} )}}{{{\text{recall}}(T_{i} ) + {\text{precision}}(T_{i} )}} $$
(6)
where ‘m’ is the number of true clusters. High precision corresponds to the least number of items from another cluster, while high recall signifies maximum coverage of items from the true cluster. High F_Score denotes good clustering quality.
Average ranks- Average ranking is the primitive and simple method to rank algorithms. This ranking is defined in Brazdil and Soares (2000). According to this method, the rand_index and F_Score defined for each dataset by every algorithm are sorted and assigned the ranks. It is worth noting that the F_Score and rand_index are treated independently while ranking the algorithms. The algorithm possessing the highest value will be assigned rank 1,the second-highest will be assigned rank 2, and so on for each dataset independently. Thereafter, the overall average rank of each algorithm is computed by taking the mean of ranks on all datasets. Let us consider $r_{j}^{i}$ be the jth algorithm rank for ith dataset. The average rank of each algorithm on total ‘n’ datasets is computed using the following Eq. (7):
$$ r_{j} = \frac{{\mathop \sum \nolimits_{{i = 1}}^{n} r_{j}^{i} }}{n} $$
(7)
Success Rate Ratio Ranks (SRR)- SRR is a ranking method where the ratio of success rates is considered between the pairs of algorithms (Brazdil and Soares 2000). This method is useful in estimating the magnitude of difference in rand_index (RI) obtained by algorithms. Also, this method aids in determining the significant differences in algorithms. If the difference is not significant, then the success rate ratio is close to 1. SRR ranking starts by taking one algorithm and one dataset at a time and calculating its rand_index ratio with the rest of the algorithms. This ratio is computed by following Eq. (8):
$$ {\text{SRR}}_{{j,k~,j \ne k}}^{i} = \frac{{{\text{RI}}_{j}^{i} }}{{{\text{RI}}_{k}^{i} }} $$
(8)
Where ‘i’ is the dataset, ‘j’ is the algorithm for which success rate is calculated and ‘k’ is the compared algorithm different from ‘j’. In this way, the success rate ratio is computed for the algorithm ‘j’ with respect to algorithm ‘k’ on ith dataset. Similarly, SRR is computed for all datasets taking the same pair of algorithms. Thereafter, SRR for all datasets are added so as to obtain an overall SRR ratio for the given pair of algorithms using the following Eq. (9):
$$ {\text{SRR}}_{{j,k~,j \ne k}} = \frac{{\mathop \sum \nolimits_{{i = 1}}^{n} {\text{SRR}}_{{j,k~,j \ne k}}^{i} }}{n} $$
(9)
where ‘n’ is the number of datasets. In this way, the success rate of the algorithm ‘j’ is computed over the algorithm ‘k’. Similarly, the success rate of the algorithm ‘j’ is computed over each of the left algorithms. After computing the success rate ratio for all datasets and summing them, the mean success rate ratio is calculated for an algorithm ‘j’ using the following Eq. (10):
$$ {\text{SRR}}_{j} = \frac{{\mathop \sum \nolimits_{k} {\text{SRR}}_{{j,k~,j \ne k}} }}{{m - 1}} $$
(10)
where ‘m’ is the total number of compared subspace algorithms. In this way, SRR for each algorithm on all datasets is computed and ranked in descending order as the higher the rand_index or F_Score, the better the algorithm.

3.2 Parameter tuning

Parameter values used in the S_FAD algorithm are shown in Table 1. Values of these parameters are decided after replicating many experiments. The best values are chosen on the basis of an algorithm’s performance.

Table 1 S_FAD parameter settings

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Abstract

Similar content being viewed by others

Analyzing Subspace Clustering Approaches for High Dimensional Data

Efficient hybrid algorithms for density based subspace clustering to deal with density divergence for improved quality and conciseness

ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data

Explore related subjects

1 Introduction

2 Subspace clustering in high-dimensional data

2.1 Proposed algorithm (S_FAD)

2.2 Self-tuned DBSCAN using FAD

3 Experimental setup

3.1 Evaluation metrics

3.2 Parameter tuning

3.3 Dataset description

4 Results and analysis

4.1 Comparison of proposed algorithm (S_FAD) with conventional subspace clustering algorithms

4.1.1 Analysis on average ranking

4.1.2 Analysis of ranking on success rate ratios

4.1.3 Statistical significance of results of proposed algorithm (S_FAD) versus other subspace clustering algorithms

4.1.3.1 Results of Wilcoxon signed-rank test on artificial datasets

4.1.3.2 Results of Wilcoxon signed-rank test on actual datasets

4.1.4 Algorithm’s scalability analysis

4.2 S_FAD on high-dimensional actual dataset

4.3 Discussion

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation