1 Introduction

High-dimensional data mean data with numerous features. High-dimensional data exist in various domains like recommendation systems, microarray data, social media data, and many more. Its rapid growth is seeking the attention of researchers and scientists for the last 2 decades (Steinbach et al. 2003; Kriegel et al. 2009). Clustering is a technique of finding groups of similar data based on their attributes. Effective clustering of the high-dimensional dataset is an important research issue in the field of data mining (Abualigah et al. 2020; Abualigah 2019). It has many challenges. Traditional clustering algorithms like K-Means, DBSCAN, OPTICS, etc., (Fahad et al. 2014) perform clustering in full-dimensional space. These algorithms attempt to find the cluster using all attributes given for each object of data. However, it becomes computationally expensive to apply these algorithms in the case of a large number of attributes/dimensions. This problem is called the “curse of dimensionality” (Steinbach et al. 2004). One of the reasons for this problem is that distance measure loses its importance as data points are sparse in high dimensional space. Clusters in such high-dimensional space remain hidden under few relevant dimensions. Such relevant dimensions i.e., subsets of features are called subspaces. Irrelevant dimensions and noise completely mask the true clusters. Thus, classical clustering algorithms fail to determine clusters lying in different subspaces. One of the efficient ways of performing clustering in high-dimensional data is subspace clustering.

Data mining research communities have given a number of techniques to perform clustering in high-dimensional data (Assent 2012; Abualigah et al. 2021a, b). To determine clusters lying in different subsets of dimensions, subspace clustering algorithms (Domeniconi et al. 2004; Parsons et al. 2004) are employed. Subspace clustering determines a similar group of objects in a set of relevant dimensions of the dataset. Subspace clustering algorithms are broadly classified into two categories: Hard subspace and soft subspace (Deng et al. 2016). Hard subspace determines precise subspaces for various clusters. However, subspaces can be overlapping on various clusters. Müller et al. (2009a, b) divided hard subspace clustering algorithms into three categories: cell-based algorithms, density-based algorithms, and clustering-oriented based algorithms. In order to obtain better subspace clusters, researchers developed a number of algorithms in the literature. Some of the relevant work is discussed below.

The evolutionary algorithm has been incorporated to form effective subspace clusters (Agarwal and Mehta 2014, 2017; Abualigah et al. 2021a, b). The first study in this field was made by Sarafis et al. (2003). Authors inculcated genetic operators for determining subspaces in the subspace clustering algorithm. The experiment is performed on 80-dimensional datasets but not compared with existing algorithms. In Lu et al. (2011) introduced a technique for soft projected clustering of high-dimensional data using particle swarm optimization algorithm (PSO). The algorithm addresses the problem of variable weighting in the projected clustering approach. The algorithm is evaluated up to a 2000-dimensional dataset. Timmerman et al. (2013) presented a new variant of the subspace k-means algorithm for shaping clusters in all dimensions. The proposed algorithm was assessed against k-means, factorial k-means, mixture factor analysis, and reduced k-means. The evaluation was made on adjusted rand_index and cluster variance on a 9-dimensional dataset only. Lin et al. (2014)suggested an evolutionary approach for determining subspaces. In order to improve clustering quality, the algorithm uses a hybrid genetic algorithm where local search is performed by PSO. The efficacy of the algorithm is assessed upon an error rate of up to 13-dimensional problems. Kaur and Datta (2015) presented an extended version of the SUBSCALE algorithm. The algorithm is examined based on F_Score up to 6144-dimensional dataset. However, no result of 6144-dimensional dataset is depicted in the paper. In Kumar et al. (2016) projected the clustering algorithm to address the issues of big data. The algorithm (clusiVat) is compared with four existing algorithms on the basis of rand_index and time for 500-dimensional data only. A survey on nature-inspired algorithms with evolutionary strategies and applications is illustrated in Agarwal and Mehta (2014). A comprehensive analysis of nature-inspired algorithms is shown in Agarwal and Mehta (2017). Comparative analysis of these algorithms on clustering is depicted in Agarwal and Mehta (2015). To improve clustering quality, an enhanced version of the flower pollination algorithm is also employed (Agarwal and Mehta 2016). Zhong and Pun (2020) have given a subspace algorithm to find similarities in data points and perform feature selection. The algorithm normalizes the column values and does not find overlapping clusters. The maximum dimensions evaluated were 6000. Yan et al. (2020) have proposed a multi-view subspace clustering algorithm which is an extended version of the K-means algorithm. Clusters formed are mostly of spherical shape and include all samples hence maximal subspaces along with overlapping clusters and noisy points could not be found. An extended version of high dimensional data clustering using multivariate t-distribution is given by Pesevski et al. (2018). The algorithm is evaluated on low dimension datasets i.e., maximum 8 dimensions. A novel subspace clustering technique for large number of samples is given by Liu et al. (2020). Though algorithm discovers clusters in low time complexity, yet overlapping clustering could not be found. Another low rank subspace clustering algorithm is given by Zhao et al. (2019). Maximum dimensions evaluated are in hundreds only. Also, the algorithms developed are not compared with conventional subspace clustering algorithms.

Though various subspace clustering algorithms have been developed yet there exist many challenges in finding subspace clusters in high-dimensional data. There is scope for improvement in clustering quality and finding overlapping subspace clusters. Existing algorithms are unable to find maximal subspace i.e., subspaces without redundant information. It should exclude noisy data points from clusters. Existing hard subspace clustering algorithms are not able to deal with high-dimensional data i.e., data with thousands of attributes. Most of the existing studies for high-dimensional clustering have been assessed on few hundreds of attributes only. Additionally, these algorithms require normalization of data in range 0–1 before performing clustering. Normalization is usually min–max type and does not handle outliers properly. Also, existing algorithms have limited capability to find clusters of varied densities. Generally, clusters found in subsets of dimensions are of the same density. This might cause few redundant data points to be included in a cluster or some relevant points left out from the cluster.

The above reasons draw the motivation for developing a new subspace clustering algorithm. This paper is an extended version of work presented in Agarwal and Mehta (2019b). In previous work (Agarwal and Mehta 2019b), the subspace algorithm presented is integrated with a differential evolution algorithm. However, the algorithm gets stuck in local solutions due to its limited capability to explore the complete search space. Also, it lags in maintaining a judicious balance of exploration and exploitation. One of the reasons is that the parameters of DBSCAN are not well found. These shortcomings are taken care in the presented work. Thorough analysis of the results and statistical analysis substantiates the performance of S_FAD with respect to subspace_DE (subspace with differential evolution) (Agarwal and Mehta 2019b) and other subspace clustering variants.

In this work, a hybrid meta-heuristic subspace clustering algorithm named S_FAD is proposed. In S_FAD, a self-tuned DBSCAN algorithm is used to perform clustering. It begins clustering with one-dimensional data, once clusters in each dimension are formed, their details are stored in the hash table. S_FAD uses the concept of hashing for finding maximal subspace clusters. In each maximal subspace, DBSCAN algorithm is executed to form clusters. It employs a bottom-up subspace search method to determine subspaces. The self-tuned version of the DBSCAN algorithm is introduced in this work, wherein input parameters of the DBSCAN algorithm are automatically determined based on the dataset. This is achieved by a hybrid meta-heuristic algorithm named as FAD algorithm (Agarwal and Mehta 2019a). The FAD is a combination of flower pollination, artificial bee colony, and differential evolution algorithm (named ABC_DE_FP in Agarwal and Mehta 2019a). Experimentally it is observed that the S_FAD algorithm can find clusters up to 6400-dimensional dataset. It overcomes the shortcomings of the bottom-up approach by automatically determining the optimized parameters suitable for a given dataset. It is successful in determining the overlapping subspace clusters. Also, the algorithm does not need any parameter priory such as the number of clusters or number of subspaces. S_FAD eliminates duplicate subspaces by determining the maximal subspaces through hashing. It obtains all possible clusters where each data point participates. Along with it, S_FAD does not normalize the original dataset as done by most of the subspace clustering algorithms and finds arbitrary-shaped clusters of varied densities.

Performance of S_FAD is evaluated on standard artificial and actual datasets and compared with various conventional subspace clustering algorithms. Evaluation parameters for assessing the performance of the S_FAD algorithm are F_Score and rand_index. Using F_Score and rand_index, S_FAD is judged on the following measures: (a) Average ranking, (b) Success rate ratio ranking, (c) Wilcoxon signed-rank test, (d) Scalability in terms of dimensions. Results demonstrate that the proposed algorithm (S_FAD) is able to handle the challenges of subspace and high-dimensional clustering considerably.

The paper is organized under various sections. Section 2 describes the proposed algorithm in detail along with pseudocode. The experimental setup is depicted in Sect. 3. Section 4 shows experimental results and the analysis of algorithms. Section 5 finally concludes the paper.

2 Subspace clustering in high-dimensional data

Clusters in high-dimensional data are mostly present in low dimension data. These dimensions may vary from cluster to cluster. Hence subspace clustering plays an important role in performing clustering in high-dimensional data. Algorithms on subspace clustering explore the subspaces existing in datasets with two different search techniques (Parsons et al. 2004). These are top-down and bottom-up subspace search methods. The top-down approach is an iterative method that starts by finding clusters with full dimensions assuming all dimensions are of equal weights. Thereafter, according to the clusters formed, each dimension is allotted a certain weight. The modified weights of dimensions are used in the successive iteration for generating new clusters. The most crucial input parameter is the size of subspaces which is difficult to determine at an early stage. This approach has the drawback of finding disjoint subspaces of equal size. Some top-down-based search algorithms (Parsons et al. 2004) are PROCLUS, FINDIT, ORCLUS, etc. The bottom-up search method starts clustering from a single dimension. It follows the Apriori principle to reduce the search space. Only those dimensions containing dense units participate in the formation of higher subspaces. This approach finds overlapping clusters and subspace clusters of arbitrary shape. Input parameters used in this approach are density threshold and grid size. CLIQUE, DOC, etc., are subspace algorithms of a bottom-up approach.

2.1 Proposed algorithm (S_FAD)

A novel density-based meta-heuristic subspace clustering algorithm is proposed for clustering high-dimensional data. The naming convention provided to this algorithm is S_FAD (subspace FAD) for ease of representation and comparison. S_FAD algorithm is inspired by Kaur and Datta (2015) and uses a bottom-up strategy of subspace clustering (Parsons et al. 2004) which is based on the Apriori principle (Agrawal et al. 1996). Accordingly, the data points that do not participate in lower-dimensional clustering are eliminated from high-dimensional subspace clustering.

As shown in Fig. 1, the algorithm begins by performing clustering on each attribute of the dataset (taking all tuples) using a self-tuned DBCAN algorithm (explained in the next subsection). Prior to clustering, every data point is allotted a large unique arbitrary integer named as the signature. In the S_FAD, this signature is considered as 15 digits natural number generated randomly for each record/tuple in the dataset. The signature of a cluster is the sum of signatures of data points belonging to the cluster. The large number is considered to avoid identical signatures of clusters. Cluster signatures act as the key to the hash table. The clustered data points with summation of signatures (of respective data points) and dimensions are noted in the hash table. The signature of dense points forms the search key of the hash table. If any entered signatures in the hash table are the same, then attributes are combined in a single entry as shown in Fig. 2. (Signatures are merely used to match the same clusters formed in the various attributes. Instead of matching each data point, signatures are matched.) This step ensures that attributes merged are part of the same subspace. Thereafter rows with the same subspaces/dimensions in the hash table are merged. Thus, relevant maximum subspaces are created with dense points. Thereafter, clustering in each subspace is performed to find dense units by a self-tuned DBSCAN algorithm. Final clusters in high-dimensional datasets are obtained in each subspace.

Fig. 1
figure 1

Clustering in single-dimensional data

Fig. 2
figure 2

S_FAD algorithm

Pseudocode of the S_FAD is presented in Algorithm 1. Assume a given dataset is X with m tuples and n attributes represented as {d1, d2, d3,…,dn).

figure a

2.2 Self-tuned DBSCAN using FAD

In the proposed algorithm, the DBSCAN (Density-based spatial clustering of applications with noise) algorithm is used to perform clustering in each dimension of the data as well as in maximal subspaces formed in a hash table. The advantage of the DBSCAN algorithm over the partition-based algorithm (Fahad et al. 2014) is that it has the ability to find clusters of arbitrary shapes and detect noisy points. Moreover, it clusters the datasets even without any former information of a number of clusters. Two input parameters used in the DBSCAN algorithm are epsilon (ε) and MinPts (τ). ε is the distance measured between two points to form a neighborhood. MinPts is the minimum number of points to form a cluster in the neighborhood of any points within ε distance. The efficacy of the DBSCAN algorithm largely depends on the input parameters ε and MinPts. These are the sensitive parameters that vary from data to data and are hard to determine priory. Thus, with respect to datasets, the optimized values of these parameters are determined by a proposed meta-heuristic algorithm named FAD algorithm. Hence, DBSCAN is named here as self-tuned DBSCAN as parameters are self-tuned in the algorithm.

FAD is a swarm intelligence-based algorithm which is an amalgamation of flower pollination (FP) (Yang 2012a, b), artificial bee colony algorithm (ABC) (Karaboga and Basturk 2007), and differential evolution (DE) (Storn and Price 1997) algorithm. FAD is a name given to the binary version of ABC_DE_FP algorithm (Agarwal and Mehta 2019a) which was developed for a continuous optimization problem. It was established that ABC_DE_FP performed better as compared to existing meta-heuristic algorithms (Agarwal and Mehta 2019a) on complex benchmark functions. Hence binary version of the algorithm is developed to optimize the parameter values of the DBSCAN algorithm i.e., Minpts and ε.

The FAD algorithm encodes each individual of the population in binary form. Each individual represents MinPts in bit string format (Karami and Johansson 2014). (Since MinPts represent a discrete value hence binary version is applied instead of continuous version). The number of dimensions ‘D’ is the number of bits required to represent a decimal number (MinPts). In FAD, the population of individuals is randomly initialized. Each individual represents the food source. The fitness of the food source is computed through an objective function. Here purity (shown in Eq. (1) is the objective function used to get the best values of MinPts and ε. Purity is an external validation criterion for measuring the quality of clusters formed. Higher the purity, better the Minpts and \(\varepsilon\). To compute purity, the most frequent class in a cluster is assigned to that cluster. Thereafter, the correctness of the class assignment to a cluster is determined by counting the number of data points assigned appropriately divided by a total number of points in the dataset ‘N’.

$$ {\text{purity}} = \frac{{\mathop \sum \nolimits_{{j = 1}}^{k} \max _{{1 < i < l}} \left( {{\text{class}}_{i} ~ \cap ~{\text{cluster}}_{j} } \right)}}{N} $$
(1)

where k is the actual cluster number, ith is the class already defined in the dataset and jth is the cluster formed from the clustering algorithm. The numerator term of Eq. (1) signifies that jth cluster has a majority of data points of ith class such that ith class is assigned to jth cluster.

The working of FAD algorithm is divided into three phases: Employed bee, onlooker bee, and scout bee. In the employed bee phase, each food source is updated with mutation (Eq. 2) and crossover strategies (Eq. 3) of differential evolution algorithm:

$$ v_{i} = ~x_{a} + F \cdot \left( {x_{b} - x_{c} } \right) $$
(2)

where \(v_{i}\) is the mutant food source, i = 1, 2,…N, xa, xb, and xc are all distinct initial (target) food sources of the same population such that xa, xb, xc ∈ N. Also, i is different from a, b and c. F is a scaling factor. In the present algorithm, F is randomly generated between the uniform distribution range betamin and betamax. Crossover is performed on mutant and target food source. New food source generated is called trial food source \(u_{{ij}}\).

$$ u_{{ij}} = \left\{ {\begin{array}{*{20}l} {v_{{ij}} ,~} \hfill & {{\text{if}}\;{\text{~rand}}\left[ {0,1} \right] \le {\text{Cr}}~{\text{or}}~j = = j_{{{\text{rand}}}} } \hfill \\ {x_{{ij}} ,} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(3)

where j is the selected index of dimension D, jrand is randomly chosen index from 1 to D. This is made to ensure that resultant vector uij receives the minimum single mutant vector \(~v_{{ij}}\). Cr is the rate of crossover which controls the choice of target or mutant vectors. Fitness of the updated solution is computed in the form of purity. If the purity value (fitness value) of the new solution is better than solution is updated in the population else its trial value is incremented by 1. Trial represents a counter maintained for each food source in the population. If a food source gets updated in the population, then its respective trial is reset to 0 else it is incremented. After the employed bee phase, food sources are updated in the onlooker bee phase. In this phase, the global and local search of food sources is controlled by switch probability p. Global search is made through the global pollination process of flower pollination algorithm (using Eq. 4)

$$ x_{i}^{{t + 1}} = x_{i}^{t} + L\left( {x_{i}^{t} ~ - ~g_{*} } \right) $$
(4)

where \(x_{i}^{t}\) is the ith food source at tth generation, \(g_{*}\) is the current best food source and L is a levy flight distribution given in Pavlyukevich (2007), Yang (2012a, b). If the algorithm switches to the local search process, then the mutation strategy of DE given in Eq. (2) is applied. Thereafter crossover technique of DE given in Eq. (3) is applied to obtain the new food source. If the fitness of the food source is improved then it is updated in a population otherwise its trial counter is incremented. Food sources with the highest fitness value (nectar amount) are memorized. If the trial counter of any food source outstrips the Limit value, then that particular food source is discarded and the scout bee explores for a new food source. The process continues unless the termination condition is satisfied. Here, termination condition is maximum iterations.

For calculating the fitness value of any food source, \(\varepsilon ~\) is calculated using respective Minpts. Epsilon value (ε) is determined analytically (Daszykowski et al. 2001) from Minpts and data matrix x using the following Eq. (5):

$$ {\text{Eps}} = \left( {\frac{{\left( {\mathop \prod \nolimits_{{i = 1}}^{{\max \left( x \right) - \min \left( x \right)}} i} \right)*k*\Gamma \left( {0.5n + 1} \right)}}{{m\sqrt {\pi ^{n} } }}} \right)^{{1/n}} $$
(5)

where m is the number of tuples and n is the dimension of each tuple of data matrix x. k is the MinPts, \(\Gamma\) is a gamma function which generalizes the factorial of a given argument.

FAD algorithm initializes the following input parameters:

  • Iterations Number of times each individual is updated by an algorithm.

  • D Number of bits used to represent MinPts.

  • N Population size

  • betamin, betamax Range of solution

  • Trial a counter incremented when a solution does not improve

  • Limit threshold value for calling scout bees (randomly initializing individual)

  • p switch probability for selecting local and global search

The flowchart of FAD is shown in Fig. 3. In DE, there are only two update functions of chromosomes i.e., mutation and crossover. While in FAD, there are three phases where mutation and crossover are used for exploitation, levy flight distribution is used for exploration. These steps in FAD algorithms help in maintaining an appropriate balance between the local and global search and hence give better results as compared to DE.

Fig. 3
figure 3

FAD algorithm

3 Experimental setup

The proposed algorithm (S_FAD) is compared with various well-known subspace clustering algorithms on actual as well as artificial datasets. Existing subspace algorithms (Müller et al. 2009a, b) compared are SCHISM (Sequeira and Zaki 2004), CLIQUE (Road and Jose 1998), MINECLUS (Yiu and Mamoulis 2003), DOC (Procopiuc 2002), INSCY (Assent et al. 2008), SUBCLU (Kailing et al. 2004), FIRES(Kriegel et al. 2005), P3C (Moise et al. 2006), PROCLUS (Aggarwal et al. 1999), and STATPC (Moise and Sander 2008). S_FAD is also compared with the subspace clustering algorithm given by (Kaur and Datta 2015) called SUBSCALE algorithm and (Agarwal and Mehta 2019b) named SUBSPACE_DE. Conventional subspace clustering algorithms are implemented on an extended WEKA toolbox provided by (Müller et al. 2009a, b). This toolbox provides space for executing various subspace clustering algorithms. S_FAD, SUBSCALE, and SUBSPACE_DE algorithms are implemented on MATLAB R2013a. Evaluation metrics, parameter setting and dataset description used for comparison are described in subsections.

3.1 Evaluation metrics

Performance evaluation of the proposed algorithm (S_FAD) against various subspace clustering algorithms is made through classification. The true cluster labels (T) for data items in each dataset are already known. Each algorithm predicts the label for each data item in the dataset. Predicted and true labels for each data item in the dataset form a confusion matrix. This confusion matrix helps in determining various evaluation measures of clustering. In this study, evaluation measures used for testing the performance of S_FAD against various subspace algorithms are rand_index and F_Score. Subsequently average ranks and success rate ratio ranks are computed using F_Score and rand_index. Evaluation measures employed are briefly described as below:

  • Rand index- It is the evaluation measure for determining the quality of clusters formed by the clustering algorithm. It is defined by the ratio of correctly labeled data items to the total number of data items. Higher the rand_index, betters the algorithm. A good clustering algorithm predicts the cluster that best portrays the true cluster and thus posses’ high-quality clusters.

  • F Score- This measure defines that the predicted cluster should cover maximum data items from the true cluster and minimum items from other clusters (Müller et al. 2009a, b). F_Score is expressed in Eq. (6):

    $$ F~{\text{score}} = \frac{1}{m}\mathop \sum \limits_{{i = 1}}^{m} \frac{{2*{\text{recall}}(T_{i} )*{\text{precision}}(T_{i} )}}{{{\text{recall}}(T_{i} ) + {\text{precision}}(T_{i} )}} $$
    (6)

    where ‘m’ is the number of true clusters. High precision corresponds to the least number of items from another cluster, while high recall signifies maximum coverage of items from the true cluster. High F_Score denotes good clustering quality.

  • Average ranks- Average ranking is the primitive and simple method to rank algorithms. This ranking is defined in Brazdil and Soares (2000). According to this method, the rand_index and F_Score defined for each dataset by every algorithm are sorted and assigned the ranks. It is worth noting that the F_Score and rand_index are treated independently while ranking the algorithms. The algorithm possessing the highest value will be assigned rank 1,the second-highest will be assigned rank 2, and so on for each dataset independently. Thereafter, the overall average rank of each algorithm is computed by taking the mean of ranks on all datasets. Let us consider \(r_{j}^{i}\) be the jth algorithm rank for ith dataset. The average rank of each algorithm on total ‘n’ datasets is computed using the following Eq. (7):

    $$ r_{j} = \frac{{\mathop \sum \nolimits_{{i = 1}}^{n} r_{j}^{i} }}{n} $$
    (7)
  • Success Rate Ratio Ranks (SRR)- SRR is a ranking method where the ratio of success rates is considered between the pairs of algorithms (Brazdil and Soares 2000). This method is useful in estimating the magnitude of difference in rand_index (RI) obtained by algorithms. Also, this method aids in determining the significant differences in algorithms. If the difference is not significant, then the success rate ratio is close to 1. SRR ranking starts by taking one algorithm and one dataset at a time and calculating its rand_index ratio with the rest of the algorithms. This ratio is computed by following Eq. (8):

    $$ {\text{SRR}}_{{j,k~,j \ne k}}^{i} = \frac{{{\text{RI}}_{j}^{i} }}{{{\text{RI}}_{k}^{i} }} $$
    (8)

    Where ‘i’ is the dataset, ‘j’ is the algorithm for which success rate is calculated and ‘k’ is the compared algorithm different from ‘j’. In this way, the success rate ratio is computed for the algorithm ‘j’ with respect to algorithm ‘k’ on ith dataset. Similarly, SRR is computed for all datasets taking the same pair of algorithms. Thereafter, SRR for all datasets are added so as to obtain an overall SRR ratio for the given pair of algorithms using the following Eq. (9):

    $$ {\text{SRR}}_{{j,k~,j \ne k}} = \frac{{\mathop \sum \nolimits_{{i = 1}}^{n} {\text{SRR}}_{{j,k~,j \ne k}}^{i} }}{n} $$
    (9)

    where ‘n’ is the number of datasets. In this way, the success rate of the algorithm ‘j’ is computed over the algorithm ‘k’. Similarly, the success rate of the algorithm ‘j’ is computed over each of the left algorithms. After computing the success rate ratio for all datasets and summing them, the mean success rate ratio is calculated for an algorithm ‘j’ using the following Eq. (10):

    $$ {\text{SRR}}_{j} = \frac{{\mathop \sum \nolimits_{k} {\text{SRR}}_{{j,k~,j \ne k}} }}{{m - 1}} $$
    (10)

    where ‘m’ is the total number of compared subspace algorithms. In this way, SRR for each algorithm on all datasets is computed and ranked in descending order as the higher the rand_index or F_Score, the better the algorithm.

3.2 Parameter tuning

Parameter values used in the S_FAD algorithm are shown in Table 1. Values of these parameters are decided after replicating many experiments. The best values are chosen on the basis of an algorithm’s performance.

Table 1 S_FAD parameter settings

The two other parameters used in the FAD algorithm are population size and iterations. These parameters are tuned by repeating the set of experiments on datasets. Population size and the number of iterations are decided by fixing any one component and varying the other. Tables 2, 3, 4 and 5 depict the parameters for the best population size and iteration value to be selected for further experiments of the S_FAD algorithm. Datasets used for tuning the parameters are ‘D50’ artificial dataset and ‘shape’ actual dataset of 50 and 17 dimensions, respectively.

Table 2 Varying population size with 10 iterations (for artificial dataset)
Table 3 Varying iterations with 40 population size (for artificial dataset)
Table 4 Varying population size with 10 iterations (for actual dataset)
Table 5 Varying iterations with 40 population size (for actual dataset)

Tables 2 and 3 correspond to the parameter tuning on an artificial dataset. It is perceived from Table 2 that for the population size of 40, the algorithm shows better rand_index and F_Score. Using population size as 40, Table 3 depicts variation in iterations. Twenty iterations on the artificial dataset are selected for further experiments. In the case of the actual dataset (assume shape dataset), S_FAD’s performance variation in population size is shown in Table 4. It is observed that the algorithm performs better for 40 population size. Adopting this population size, variation in iterations is shown in Table 5. Twenty iterations on the actual dataset are selected for further experiments. For the ease of implementation on higher-dimensional datasets (500 and above dimensions) (Bache and Lichman 2006), population size is assumed to be 10 with 20 iterations.

Parameter values of other compared algorithms are obtained from their respective work. For SUBSCALE algorithm, parameter values are taken from the original author’s work (Kaur and Datta 2015). Set of parameter values for existing subspace clustering algorithms are defined in (Müller et al. 2009a, b).

3.3 Dataset description

In order to establish the efficacy of the S_FAD algorithm against various subspace clustering algorithms, it is evaluated on artificial and actual datasets. The description of datasets is shown in Tables 6 and 7. Artificial datasets used in this research work are composed of 10 to 75 dimensions. These datasets are used in work (Müller et al. 2009a, b). Table 6 (also used in Agarwal and Mehta 2019b) gives the name of artificial datasets along with their size. Real datasets used in this research work are divided into two categories: small dimensions and high dimensions. These datasets are the standard datasets and have been used by various researchers for evaluation of the algorithm’s performance (Assent et al. 2008; Moise and Sander 2008; Sequeira and Zaki 2004). Table 7 depicts the number of dimensions and instances of each actual dataset used in experiments. The maximum dimensional dataset evaluated by S_FAD is DrivFace with 6400 attributes.

Table 6 Description of artificial datasets
Table 7 Description of actual datasets

4 Results and analysis

Thorough experiments are carried out for evaluating the performance of the proposed algorithm (S_FAD) on small and high-dimensional datasets. S_FAD takes 30 independent runs on small dimensional datasets and the mean value is considered for comparison. For high-dimensional datasets (Madelon, Micromass, Gissette, and Drivface), 10 independent runs of S_FAD are considered due to hardware limitations. F_Score and rand_index of various conventional algorithms except for SUBSCALE and SUBSPACE_DE are obtained by re-implementing on the extended WEKA framework provided by Müller et al. (2009a, b). While S_FAD, SUBSCALE and SUBSPACE_DE are implemented on Matlab.

In order to explore the experimental outcome, this section is comprised of two subsections: Sect. 4.1 illustrates the comparison of S_FAD algorithm with various conventional subspace algorithms on different datasets. Section 4.2 depicts results of S_FAD on high-dimensional actual datasets.

4.1 Comparison of proposed algorithm (S_FAD) with conventional subspace clustering algorithms

Figures 4 and 5 depict the actual values of S_FAD and conventional subspace clustering algorithms for rand_index and F_Score respectively on artificial datasets. In these figures, results of conventional subspace clustering algorithms are adapted from Agarwal and Mehta (2019b), Müller et al. (2009a, b). It can be observed from Fig. 4 that S_FAD attains more than 90% accuracy on the majority of artificial datasets. On small dimensions i.e., D10 dataset S_FAD gives better performance than all existing algorithms except SCHISM where it lags by approximately 4%. On the D15 dataset, percentage enhancement in the accuracy of S_FAD is much higher than percentage decrease from DOC (approximately 5%) and SCHISM (approximately 4%). S_FAD lags by 4% only from INCY while superior to all other algorithms for D20. On the D25 dataset, S_FAD depicts better performance than compared algorithms except for MINECLUS and SCHISM where percentage lag is by 8% and 2% respectively. On D50 and D75 dimensions, S-FAD outperforms all algorithms in terms of rand_index. SCHISM and INSCY algorithms are capable of achieving good rand_index but fail to provide results for 50 and 75-dimensional datasets. A similar kind of trend is exhibited on F_Score which is shown in Fig. 5.

Fig. 4
figure 4

Rand_index of algorithms on artificial datasets

Fig. 5
figure 5

F_Score of algorithms on artificial datasets

Figures 6 and 7 show the results of proposed algorithm against various conventional algorithms on liver disorder, glass, diabetes, shape, breast and vowel datasets in terms of rand_index and F_Score respectively. It can be witnessed from Fig. 6 that the proposed algorithm (S_FAD) gives similar or better performance as compared to other various subspace clustering algorithms. For the glass dataset, S_FAD lags by 7.1% total from CLIQUE, SUBCLU, and INCY, while enhances by overall 171.3% from the rest of the algorithms. On the Diabetes dataset, S_FAD gives a comparable performance. It improves by overall 33.2% and worse by 6.9%. For the liver disorder dataset, S_FAD outperforms all other subspace clustering algorithms in terms of rand_index. For breast cancer and shape dataset, performance is very much similar to other algorithms. In the vowel dataset, S_FAD total percentage enhancement is much higher than total percentage decrement (11% from CLIQU, SCHISM, INSCY). For the pendigits dataset, S_FAD remains near to the best performing algorithm. It can be witnessed from the above observation that though S_FAD does not outperform on all actual dataset; however, the average percentage enhancement of the algorithm is much higher than average percentage decrement. Thus, S_FAD performs above average on actual datasets. Analysis for F_score is similar to that of rand_index.

Fig. 6
figure 6

Rand_index of algorithms on actual datasets

Fig. 7
figure 7

F_Score of algorithms on actual datasets

It can be noticed that SUBCLU could not cope with pendigits dataset as the dataset has 7494 instances. Thus, it can be concluded that in the terms of F_Score and rand_index, S_FAD gives a good performance as compared to other subspace clustering algorithms.

In order to further establish the results, two well-recognized ranking methods—F_Score and rand_index are used for analysis (Brazdil and Soares 2000). Also, the statistical significance of algorithms is portrayed using a Wilcoxon signed-rank test. The subsections included in this section are as follows. Section 4.1.1 depicts the average ranking of algorithms on artificial and actual datasets, Sect. 4.1.2 shows the success rate ratio ranking of algorithms, Sect. 4.1.3 describes Wilcoxon signed-rank test and final Sect. 4.1.4 illustrates the scalability of S_FAD in terms of data dimensionality.

4.1.1 Analysis on average ranking

The average ranking of S_FAD and various existing subspace algorithms on artificial datasets and actual datasets are computed independently based on F_Score and rand_index. Table 8 presents the average ranking of subspace algorithms on rand_index and F_Score values for artificial datasets (depicted in Figs. 4 and 5). S_FAD attains the first rank based on both rand_index and F_Score. SCHISM occupies the second rank in terms of rand_index. Based on F_Score, DOC occupies the second rank which belongs to cell-based category. It can be concluded from the above analysis that next to S_FAD, cell-based algorithms depict approximately similar performances on artificial datasets, while clustering-oriented-based algorithms occupy ranks in second-half positions (i.e., after 5th position).

Table 8 Average rank of algorithms on artificial datasets

Table 9 represents the average ranking of subspace algorithms on actual datasets in connection with rand_index and F_Score. CLIQUE algorithm is a winning algorithm based on rand_index but in terms of F_Score, it stands in the third last position. However, S_FAD gives a very consistent behavior and holds on the second position concerning both rand_index and F_Score. MINECLUS depicts the first and fifth positions with respect to F_Score and rand_index respectively. It can also be observed that two cell-based algorithms i.e., CLIQUE and MINECLUS depict extreme performance, whereas S_FAD is consistent. Additionally, except for these two cell-based algorithms (CLIQUE and MINECLUS), none of the subspace algorithms gives good performance in connection with rand_index and F_Score. Thus, S_FAD shows sufficiently better results for both measures.

Table 9 Average rank of algorithms on actual datasets

4.1.2 Analysis of ranking on success rate ratios

Success rate ratio (SRR) ranks of S_FAD and other existing subspace clustering algorithms are calculated on artificial and actual datasets. Table 10 depicts the SRR rank of all subspace algorithms concerning rand_index and F_Score on artificial datasets. It can be seen that the S_FAD wins the among various subspace algorithm and stands at the first position. The second position is occupied by SUBSPACE_DE with respect to rand_index and F_Score. These results indicate that density-based algorithms win over cell-based and clustering-oriented based algorithms.

Table 10 SRR Rank of Algorithms on Artificial Datasets

Table 11 depicts the ranking of various subspace algorithms based on the success rate ratio on actual datasets. S_FAD stands at the fifth and fourth position corresponding to F_Score and rand_index respectively. Conversely, its average rank is 2. The reason for such difference in average and SRR ranks is that S_FAD could not perform very well for few low-dimensional datasets like pendigits, shape, and vowel. However, S_FAD gives a consistent performance as its ranks are approximately the same for rand_index and F_Score. The performances of CLIQUE, MINECLUS, DOC, PROCLUS, INSCY vary largely to rand_index and F_Score. CLIQUE algorithm holds the first position and gives the highest rand_index but at the same time, it gives poor F_Score on actual datasets. Whereas S_FAD that gives a comparable performance for both rand_index and F_Score on actual datasets.

Table 11 SRR Rank of Algorithms on Actual Datasets

Also, it has been observed from the analysis that the overall percentage enhancement of S_FAD is better as compared to the percentage decrease from other subspace clustering algorithms. Thus, it can be concluded that S_FAD overall presents considerably good efficacy, better consistency, and reliability in view of rand_index and F_Score for the majority of small dimensional datasets. S_FAD is a hard subspace clustering algorithm that performs clustering on high-dimensional datasets (shown in Sect. 5.2). The next section discusses the statistical significance of obtained results using the Wilcoxon signed-rank test.

4.1.3 Statistical significance of results of proposed algorithm (S_FAD) versus other subspace clustering algorithms

To further strengthen the performance of S_FAD over conventional subspace clustering algorithms, a statistical hypothesis test is performed. This test is used to establish the statistical significance of results obtained through experiments. The test starts with the null hypothesis H0 and alternative hypothesis H1. These two hypothesis claims as follows:

  • H0 = results of compared algorithms are statistically the same

  • H1 = results of compared algorithms are statistically not the same

For the problem given at hand, it is suitable to use a non-parametric test (Demšar 2006) as there are no assumptions on population distribution. Wilcoxon signed-rank statistical test is used to determine the significant difference in results obtained by various subspace algorithms on each dataset. It finds the difference, overlooking the signs in rand_index and F_Score on artificial and actual datasets. Thereafter, algorithms are ranked and compared considering positive and negative differences. The significance level is assumed 5% i.e., the probability of rejecting the null hypothesis when it is true is 5%. In this work, the significant difference of S_FAD algorithm with other subspace algorithms is determined based on rand_index and F_Score.

4.1.3.1 Results of Wilcoxon signed-rank test on artificial datasets

Tables 12, 13, and 14 show the Wilcoxon signed-rank test of the proposed algorithm (S_FAD) versus density-based, cell-based, and clustering-oriented based subspace algorithms respectively on artificial datasets. It is examined from Table 12 that S_FAD shows a significant difference from SCHISM on F_Scores. However, no significant difference (H0 TRUE) is obtained in performances of S_FAD versus cell-based algorithms on artificial datasets. This is because the value of parameter ‘W’ surpasses the critical value of the Wilcoxon signed-rank test.

Table 12 Wilcoxon signed-rank test of S_FAD vs. Cell-based
Table 13 Wilcoxon signed-rank test of S_FAD versus density-based
Table 14 Wilcoxon signed-rank test of S_FAD versus clustering oriented

Table 13 illustrates the outcomes of the Wilcoxon signed-rank test of S_FAD against density-based algorithms. It is observed that for FIRES, SUBSCALE, and SUBSPACE_DE algorithms, parameter ‘W’ is equal to critical value i.e., 0, which means S_FAD outperformed all artificial datasets. Hence, it shows a significant difference with respect to these two algorithms. There is no significant difference between the S_FAD and INSCY algorithm with respect to rand_index and F_Score. S_FAD shows better results for the majority of datasets. Thus, it can be concluded that S_FAD shows overall better performance than existing density-based algorithms. Table 14 represents a statistical test of S_FAD against clustering-oriented subspace algorithms. It is observed that S_FAD surpasses subspace clustering-oriented based algorithms i.e., P3C, PROCLUS, and, STATPC in regard to rand_index and F_Score. It is also seen that W- = 0 which means S_FAD outperforms on all datasets and depicts higher rand_index and F_Score than clustering-oriented based algorithms.

4.1.3.2 Results of Wilcoxon signed-rank test on actual datasets

Table 15 shows the statistical significance test in form of the Wilcoxon signed-rank test of S_FAD versus the cell-based algorithm on actual datasets. Since the number of actual datasets considered for evaluation is 7 and two-tailed tests have been employed; therefore, critical value (seen from Wilcoxon signed-rank test table) is 2. It has been observed that S_FAD shows the statistical difference in results from the CLIQUE algorithm based on F_Score. Since W− = 0, that means S_FAD performs better on all datasets. For the rest of the cell-based algorithms, S_FAD shows no significant difference (H0 TRUE).

Table 15 Wilcoxon signed-rank test of S_FAD versus cell-based

Table 16 demonstrates the Wilcoxon signed-rank test of S_FAD against density-based subspace algorithms on actual datasets. S_FAD depicts better performance with respect to F_Score and rand_index against FIRES, SUBCLU, SUBSCALE, and SUBSPACE_DE algorithms. While on INSCY, no statistically significant difference is pointed. Also for SUBCLU, SUBSCALE, and SUBSPACE_DE algorithms W− = 0 which means S_FAD is superior on all datasets.

Table 16 Wilcoxon signed-rank test of S_FAD versus density-based

Table 17 represents the statistical difference of S_FAD versus clustering-oriented based subspace algorithms. S_FAD gives a statistically better rand_index than the P3C algorithm. However, value of W’ exceeds the critical value of PROCLUS and STATPC and hence there is no significant difference in the performance of S_FAD in connection with rand_index and F_Score. It is noticeable that W+  > W− even if there is no significant difference, that means S_FAD is a better performer for clustering-oriented based algorithms. From the above discussions, it can be established that S_FAD performs considerably superior to the majority of algorithms on most of the datasets.

Table 17 Wilcoxon signed-rank test of S_FAD versus clustering oriented

4.1.4 Algorithm’s scalability analysis

Scalability of algorithms means the performance of algorithms with increasing dimensions of data. Scalability could be shown only on artificial datasets as these datasets have the same number of instances with varied attributes. In this work, scalability has been shown with respect to both rand_index and F_Score. The scalability of S_FAD versus cell-based is shown in Figs. 8 and 9 respectively. The scalability is represented in terms of rand_index/F_Score (y-axis) on data dimensionality (x-axis). It has been scrutinized that the proposed algorithm (S_FAD) depicts comparable performance from 10 to 25 dimensional datasets while it gives better performance on 50 and 75 dimensional datasets. DOC and MINECLUS reveal random behavior with an increase in dimensionality. Figures 10 and 11 show the scalability of S_FAD on density-based algorithms. INCY performed very well with 25-dimensional dataset, however, an algorithm could not handle higher dimension dataset. S_FAD takes the lead on the 50 and 75 dimensional dataset. The scalability of S_FAD versus clustering-oriented based algorithms is depicted in Figs. 12 and 13. S_FAD gives better results in almost all datasets against clustering-oriented algorithms like STATPC, P3C, and PROCLUS.

Fig. 8
figure 8

Scalability of S_FAD versus Cell based on Rand_index

Fig. 9
figure 9

Scalability of S_FAD versus Cell based on F_Score

Fig. 10
figure 10

Scalability of S_FAD versus density based on Rand_index

Fig. 11
figure 11

Scalability of S_FAD versus density based on F_Score

Fig. 12
figure 12

Scalability of S_FAD versus clustering oriented on RI

Fig. 13
figure 13

Scalability of S_FAD versus clustering oriented on F_Score

Thus, S_FAD shows a good sign of improvement with scalability on data dimensionality as compared to various subspace clustering algorithms. The next section illustrates the performance of S_FAD on very high-dimensional datasets.

4.2 S_FAD on high-dimensional actual dataset

To validate the efficacy of the proposed algorithm (S_FAD) on high-dimensional data, it is implemented on actual datasets with a large number of attributes. It is found that S_FAD is successful in forming subspace clusters on high-dimensional datasets. Tables 18 and 19 present results in the form of best (maximum), worst (minimum), mean (average), median and standard deviation of F_Score and rand_index at 10 independent runs. Other subspace clustering algorithms could not cope up with high-dimensional datasets. Also, the results of any other clustering algorithm are not available on such high-dimensional datasets. SUBSCALE algorithm (Kaur and Datta 2015) was attempted on the MADELON dataset but exact values of results are not revealed in their work. The actual datasets included for the study are MADELON, MICROMASS, GISSETTE, and DRIV FACE with 500, 1300, 5000, and 6400 dimensions respectively.

Table 18 F_Score of S_FAD on high-dimensional actual datasets
Table 19 Rand_index of S_FAD on high-dimensional actual datasets

It is perceived from Tables 18 and 19 that the proposed algorithm (S_FAD) shows a high rand_index on MADELON, MICROMASS, and DRIV FACE datasets. In the case of the GISSETTE dataset, the algorithm’s efficacy is low because of the sparse dataset. Since the standard deviation is very low, algorithms do not require many independent runs for obtaining the best results.

4.3 Discussion

S_FAD is assessed against 11 subspace clustering algorithms on a total of 13 datasets including actual and artificial with respect to rand_index and F_Score. On small dimensions, it is found that the total percentage enhancement of S_FAD is higher than the total percentage lag (as compared to other algorithms). Subspace algorithms are also ranked on the basis of average ranking and SRR ranking on artificial and actual datasets independently. The Wilcoxon signed-rank test, is performed to validate the significant variation in results obtained by subspace algorithms. From the above experimental results and analysis, it is inferred that S_FAD reveals better performance in terms of F_Score and rand_index against most of the existing subspace algorithms on the majority of datasets. Additionally, S_FAD is executed on high-dimensional actual datasets. The results exhibit that the proposed algorithm scales very well on high-dimensional thin data sets. As the size of the dataset increases, some other measures like sampling are required to be inculcated to improve the efficacy. The time complexity of the S_FAD algorithm is O((2m).(nlogn)) where m is the number of dimensions and n is the number of records in the dataset. Hence, there is a trade-off between rand_index and time complexity. Although the time complexity of the proposed S_FAD algorithm is relatively high, it can provide near-optimal solutions on high-dimensional problems. On the contrary traditional subspace clustering algorithms are unsuccessful in providing the results for high-dimensional data. Also, S_FAD finds overlapping subspace clusters with no redundant information. This property is hardly satisfied by any of the other subspace clustering algorithms. The unique property of S_FAD is it can determine subspace clusters of varied densities.

Hence it can be established that conventional subspace clustering algorithms are more suited for applications involving small dimensions and have time-sensitive requirements. However, for high dimension applications adopting the S_FAD algorithm would be a better choice.

5 Conclusion

Subspace clustering in a large number of attributes is a computational challenge in the data mining field. This challenge includes an unknown number of subspaces and dimensions involved in each subspace before clustering. Additionally, there can be an exponential number of subspaces for the high-dimensional dataset. To resolve these issues, a novel subspace clustering algorithm S_FAD is proposed. The efficacy of S_FAD is better in terms of F_Score and rand_index. Using these evaluation measures, the proposed algorithm is compared with various conventional subspace clustering algorithms on basis of 4 parameters i.e., average ranking, success rate ratio ranking, Wilcoxon signed-rank test, and scalability on dimensions. S_FAD provides considerably good performance when analyzed on these parameters. It does not give any redundant information of subspaces as it performs clustering in maximal subspaces. S_FAD takes input data in its original form without normalization. It could determine overlapping subspace clusters of varied densities. Thus, S_FAD provides several advantages over existing subspace clustering algorithms and is successful in determining clusters in 6400-dimensional actual datasets. In future, the S_FAD algorithm can be applied in applications of distributed databases such as vertical and horizontal fragmentation. Also, some useful techniques can be incorporated into the algorithm to cluster large datasets and improve clustering quality.