1 Introduction

There is a natural requirement for the effective methods for accessing data and extracting useful knowledge, with regard to the increasing expansion of data on different storage media. Data mining consists of the most effective methods in this field. The data mining is an iterative process in order to make the discovery of knowledge which is done manually and automatically. The data mining searches valuable and new information from the huge volume of data (Gnanapriya et al. 2010).

Description and prediction are the main aims of the data mining. In the first category, data attributes are described in a dataset and its focus is about finding patterns from the dataset so that the found patterns can be described by human. The second category is based on data deduction, looking for unknown variables and values of the data (Mining 2006). Each of these categories includes different patterns such as exploring frequent patterns, classification and regression, clustering and exploring outline patterns, each of which has its own application and features. The aim of this study is to investigate the clustering analysis, which is part of the descriptive patterns with regard to the type of data used for data mining (Pujari 2001).

In clustering, we can create a grouping of data and so, its main aim is to maximize similarity between samples of a cluster as well as minimizing similarity between samples of the various clusters (Berkhin 2006). The clustering widely helps discovery of unknown patterns in data and has a vast application in the various fields including Web searching, security and business intelligence as well (Vercellis 2011; Han et al. 2011). The data mining in business intelligence as a powerful and advanced technology will enable companies to have more focus on important data in data warehouses. It can help corporations to adopt effectively knowledge-based decisions in order to increase business profits using data mining tools (Hema and Malik 2010).

The multidimensional data analysis is one of the most important factors in improving efficiency and increasing the data mining speed in business intelligence. In this study, due to clustering, data cube-contained datasets were used; this type of data provides the possibility of analysis in various aspects. In the following, some of the works done in the data cube are investigated.

The data warehouses that contain collected data from data sources and are around a specific topic provide possible widespread. The data require the complex analysis for managers by using OLAP tools (Hema and Malik 2010). The data warehouse and OLAP tools are based on a multidimensional data model; therefore, the data cube is the best concept for data modeling in several dimensions, in which data are represented by dimensions and facts. In addition, it is possible to use the OLAP operation in order to create views, interactive query and perform data analysis in the data cube (Chaudhuri and Dayal 1997). With regard to Liço (2017), OLAP is introduced as the main component of business intelligence and data cube is considered as an OLAP’s main component. Moreover, it considers the data cube as the most powerful tool for using in Big Databases. The study introduces intelligent cube in order to reduce system response time and also addresses using compression techniques to reduce storage memory space.

Introducing clustering algorithm for modeling of the data cube and collecting information from cuboid has been already done by Woo et al. (2015). In this study, the amounts of special attributes contain flow of large data and cuboid is used for saving different parts of flow data and so clustering is carried out on this type of data. A research was done on hierarchical-based clustering algorithm (Ceci et al. 2015) through continuous data, and their aim was using the algorithm in applications including wireless sensor network.

In the current research, data cube clustering is considered to prepare an efficient unsupervised analysis through the data. The challenge of cube data clustering is the existence of irregular and specific data (when data size is high), which makes it difficult to cluster. There are several approaches for clustering (Joshi and Kaur 2013), which include partitioning method, hierarchical-based clustering, density-based clustering and grid-based clustering, and among these four approaches, only a density-based approach has the ability to identify nonconvex clusters.

Therefore, in order to achieve higher efficiency, we use density-based clustering methods. Among the density-based clustering methods, the method of DBSCAN is widely used in comparison with other methods. The popularity reasons of the DBSCAN are its simplicity to performance and its ability to recognize clusters with different sizes and nonconvex shapes (Berkhin 2006; Cheng 2017). Hence, in the current research, the DBSCAN algorithm is selected for density-based clustering. The DBSCAN is a very good candidate to find nonconvex clusters in data space (Kumar and Reddy 2016). The challenge of the DBSCAN clustering is the cluster’s dependence on its parameters such as the neighborhood radius and the minimum points. These parameters are empirically chosen according to the type of data. So, the fine-tuning of these parameters has a significant role in identifying proper clusters.

There are several studies which tried to improve DBSCAN. In Smiti and Eloudi (2013), fuzzy set theory was applied to design fuzzy clustering and improve DBSCAN that the authors called Soft DBSCAN. The Soft DBSCAN is a new fuzzy clustering, which offers appropriate primal degrees for data’s membership to express proximities of data entities to the cluster centers. A graph-based index structure method group (Kumar and Reddy 2016) was proposed to improve the performance of DBSCAN on high-dimensional dataset that accelerated the neighbor search operations. A new measurement criterion (Cheng 2017) was utilized to obtain a distance which was calculated based on the threshold analysis of the nearest neighbor with the total neighbors. Darong and Peng (2012) combined partition technique with DBSCAN. The goal was to obtain the proper input parameters for DBSCAN. However, the effectiveness of this method was not evaluated for datasets with different densities. A combination of Gaussian-means method with DBSCAN (Smiti and Eloudi 2012) was proposed to improve the determination of DBSCAN parameters. However, Gaussian-means creates circular clusters that are not density-based and do not act very well against dense data as well. The DBSCAN clustering was combined with binary differential evolution (Karami and Johansson 2014) to determine the parameters of the DBSCAN. Recently, many meta-heuristic algorithms have been presented to improve clustering on various algorithms for reducing clustering sensitivity to the important parameters of the algorithms (Chen 2012; Pei et al. 2008; Zhao 2007; Aydilek and Arslan 2013).

Among them, there is a lack of improvement in the DBSCAN as a density-based clustering with a meta-heuristic algorithm such as earthworm optimization algorithm (EWOA) (Wang et al. 2018). Therefore, in this paper, the EWOA is considered to identify the best parameters for the DBSCAN. The proposed clustering algorithm, which is called the improved DBSCAN (EWOA–DBSCAN), contains a hybridization of the DBSCAN with the EWOA on the obtained 2D data from the data cube based on the new dimension move and similarity metric.

The EWOA’s challenge is tuning Reproduction 1 and Reproduction 2 parameters that are also empirically determined. The parameters have a significant impact on the efficiency and convergence of the algorithm. There have been several attempts to dynamically tune the other meta-heuristic parameters such as proposing a diversity measure between GA’s chromosomes in the population, designing a variation depending on the fitness function values of GA’s chromosomes and planning fuzzy logic controllers (FLCs) (Angelova and Pencheva 2011; Karafotias et al. 2015), but this type of tuning parameters has not been considered for the EWOA. An FLC is constructed with a knowledge base, which consists of the information given by linguistic control rules, a fuzzification interface, an inference system and a defuzzification interface (Liu and Lampinen 2005; Herrera and Lozano 2003). Therefore, in order to modify the EWOA–DBSCAN, another data cube clustering algorithm is proposed, which attempts to improve accurate selection of effective EWOA parameters by using a new simple FLC based on two group linguistic control rules such as Mamdani’s rules and Takagi–Sugeno’s rules. We name the second algorithm as “soft improved DBSCAN” (EWOA–DBSCAN- Mamdani and EWOA–DBSCAN-Sugeno). The proposed ideas and their achievements will be considered to design 3D clustering in our future work.

The rest of the paper is organized as follows. In Sect. 2, we describe preprocessing of data cube based on data cube structure, assigning an address to each cube cell, designing a dimension move to create a related 2D data from the data cube and proposing a new similarity metric. The proposed clustering algorithms are explained in Sect. 3 and consist of three investigated evaluation indices, DBSCAN algorithm, improved and soft improved DBSCAN algorithms. Then, in the following, the experimental results are illustrated in Sect. 4. Finally, conclusion and future works are explained in Sect. 5.

2 Preprocessing of data cube

In order to smoothly explain the proposed clustering algorithms and demonstrate their scalability, it is necessary to first describe in four steps the cube data structure, the normalizing data cube, the dimension move and the new similarity metric in this section. To clarify these descriptions, a data cube sample is designed with size \( 5 \times 4 \times 3 \), which is shown in Fig. 1. The sample’s values belong to the sales of three well-known hypermarkets in Iran, namely Refah (Ref), Etkah (Etk) and Shahrvand (Sha). The sales are obtained in five cities in Iran, namely Tehran (Teh), Shiraz (Shi), Mashhad (Mash), Tabriz (Tab) and Esfahan (Esf). Each city’s hypermarket sales have been measured in four seasons, namely winter, spring, summer and fall. It is assumed that the cities are represented by X (\( \left| X \right| = 5 \)), the seasons by Y (\( \left| Y \right| = 4 \)) and the hypermarkets by Z (\( \left| Z \right| = 3 \)). So, the steps of preprocessing are described as follows.

Fig. 1
figure 1

A data cube’s structure, dimension separation and dimension move

2.1 Data cube structure

A normal datum is a two-dimensional array of values in which each row is called an object and each column is called as attribute. Since, data cube is a multidimensional array of values that it has more than 2D. Each dimension of data cube belongs to an attribute. The investigated data cube of the current research consists of 3D array of values, which is called \( {\text{cube(}}x,y,z )= {\text{value}} \). In Fig. 1, there are three attributes, namely city (x), season (y) and hypermarket (z), and for example, \( {\text{cube}}(3,2,1) = 163 \) represents 163 sale units of Shahrvand Hypermarket on Mashhad city in spring. In the research, an address is assigned to each cube cell based on the following function:

$$ V_{r} : r = \left| {Z \times Y} \right| \times (i - 1) + \left| Y \right| \times (k - 1) + j $$
(1)

where \( i \), \( j \) and \( k \) represent the cell indices for \( X \), \( Y \) and \( Z \), respectively. For example, in Fig. 1, \( v_{42} \) is the address of \( {\text{cube}}(4,2,2) = 600 \). Also, the above function has the following inverse function:

$$ i = \left\{ {\begin{array}{*{20}l} {{\text{div}}\left( {r, \left| {Z \times Y} \right|} \right),} \hfill & {{\text{if}}\;\bmod \left( {r,\left| {Z \times Y} \right|} \right) = 0} \hfill \\ {{\text{div}}\left( {r, \left| {Z \times Y} \right|} \right) + 1,} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(2)
$$ k = \left\{ {\begin{array}{*{20}l} {{\text{div}}\left( {\left( {r - \left| {Z \times Y} \right| \times (i - 1)} \right),\left| Y \right|} \right),} \hfill & {{\text{if}}\; {\text{mod}}\left( {\left( {r - \left| {Z \times Y} \right| \times (i - 1)} \right),\left| Y \right|} \right) = 0} \hfill \\ {{\text{div}}\left( {\left( {r - \left| {Z \times Y} \right| \times (i - 1)} \right), \left| Y \right|} \right) + 1,} \hfill & {{\text{otherwise}} } \hfill \\ \end{array} } \right. $$
(3)
$$ j = r - \left| {Z \times Y} \right| \times (i - 1) - \left| Y \right| \times (k - 1) $$
(4)

The address function and its inverse create a two-way relationship between 3D and 2D search spaces. Because the proposed data cube clustering converts the 3D space of data cube into a two-dimensional space by moving one dimension along the other dimensions using the Dimension Move Algorithm.

2.2 Normalizing data cube

There are different scaling sizes between the 3D attributes of the data, and application of normalizing the attributes causes removal of the effect of larger scale attributes on smaller one. The min–max normalization (Han et al. 2011) is a linear converter of \( v_{i} \in A \) into \( v^{\prime }_{i} \in A^{\prime } \) with new bound \( \left[ {{\text{new}}\_{ \hbox{min} }_{{A^{\prime } }} ,{\text{new}}\_{ \hbox{max} }_{{A^{\prime } }} } \right] \). The min–max normalization was proposed for 2D data, and a 3D draft of the normalization (see Fig. 1), which is linearly converted to \( [0,1] \), is designed in the current research whose procedure is shown for the first layer as follows:

$$ \left\{ {\begin{array}{*{20}c} {{\text{cube}}^{\prime } (i,j,1) = \frac{{{\text{cube}}(i,j,1) - \mathop {\hbox{min} }\limits_{{\begin{array}{*{20}c} {1 \le i \le \left| X \right|} \\ {1 \le j \le \left| Y \right|} \\ \end{array} }} {\text{cube(}}i,j,1 )}}{{\mathop {\hbox{max} }\limits_{{\begin{array}{*{20}c} {1 \le i \le \left| X \right|} \\ {1 \le j \le \left| Y \right|} \\ \end{array} }} {\text{cube(}}i,j,1 )- \mathop {\hbox{min} }\limits_{{\begin{array}{*{20}c} {1 \le i \le \left| X \right|} \\ {1 \le j \le \left| Y \right|} \\ \end{array} }} {\text{cube(}}i,j,1 )}}} \\ \vdots \\ {{\text{cube}}^{\prime } \left( {i,j,\left| Z \right|} \right) = \frac{{{\text{cube}}\left( {i,j,\left| Z \right|} \right) - \mathop {\hbox{min} }\limits_{{\begin{array}{*{20}c} {1 \le i \le \left| X \right|} \\ {1 \le j \le \left| Y \right|} \\ \end{array} }} {\text{cube}}\left( {i,j,\left| Z \right|} \right)}}{{\mathop {\hbox{max} }\limits_{{\begin{array}{*{20}c} {1 \le i \le \left| X \right|} \\ {1 \le j \le \left| Y \right|} \\ \end{array} }} {\text{cube}}\left( {i,j,\left| Z \right|} \right) - \mathop {\hbox{min} }\limits_{{\begin{array}{*{20}c} {1 \le i \le \left| X \right|} \\ {1 \le j \le \left| Y \right|} \\ \end{array} }} {\text{cube}}\left( {i,j,\left| Z \right|} \right)}}} \\ \end{array} } \right. $$
(5)

where \( i = 1, \ldots ,\left| X \right| \) and \( j = 1, \ldots ,\left| Y \right| \).

2.3 Dimension move

There are technical reshape data cube and 3D matrix such as Scovanner et al. (2007), Zhao and Yang (2015) and Johnson et al. (2013) which have different complexity levels and are applied in image processing. In the current research, a dimension move is designed to convert a 3D matrix into a 2D matrix (see Fig. 1). To achieve the aim, first, sub-2D matrixes of the 3D matrix are considered by dimension separation along a given dimension. For example, three sub-2D matrixes (\( X \times Y \)) are separated along Z in Fig. 1. Second, the 2D matrixes are joined together along one of their dimensions. In Fig. 1, the 2D separated matrixes are joined together along Y which create a 2D matrix with \( \left| X \right| \) rows and \( \left| Y \right| \times \left| Z \right| \) columns. The procedure of dimension move is presented in Algorithm 1.

figure a

2.4 New similarity metric

As mentioned, each cell of a data cube is indicated with four values such as \( {\text{Cube(}}i,j,k )= {\text{value}} \), where \( i,j,k \) are dimension values of X (\( X{\text{Cube}} \)), Y (\( Y{\text{Cube}} \)) and Z (\( Z{\text{Cube}} \)), respectively. Because the obtained 2D data from a given data cube consist of quantitative values for its rows and column labels, a new similarity metric is proposed to calculate distance between rows A and B in the 2D data. The similarity’s procedure is presented in Algorithm 2. The similarity metric is utilized to calculate dissimilarity of two rows in the procedure of DBSCAN clustering (Algorithm 3) and the investigation evaluation indices between the samples in each cluster [(7) and (8)].

figure b

Here, \( \delta \) is an effectiveness parameter for comparing quantitative values of \( X{\text{Cube}} \), \( Y{\text{Cube}} \) and \( Z{\text{Cube}} \). The parameter is tuned in the range of \( \left[ {0.2,0.4} \right] \), since 3D matrix was normalized.

3 The proposed clustering algorithms

With regard to the proposed preprocessing, the data cube with its 3D matrix structure was converted to a related 2D data by the dimension move (Algorithm 1) and the distance between rows A and B were calculated by a new similarity metric (Algorithm 2) in the 2D data. Now, two improved drafts of density-based clustering are proposed to solve data cube clustering. In the next subsections, first, three investigation evaluation indices are explained. Second, DBSCAN algorithm is presented and also the challenge of DBSCAN algorithm and novel strategies are proposed to improve it.

3.1 The investigation evaluation indices

Three investigation evaluation indices, namely the Davies–Bouldin index (DBI) (Davies and Bouldin 1979), Dunn index (DI) (Bezdek and Pal 1995) and Silhouette index (SI) (Rousseeuw 1987), are considered to evaluate the obtained clusters with DBSCAN. The DBI calculates within-cluster distance and between-cluster distance. The best choice of clusters will be done, since the DBI is minimized and the index is formulated as follows:

$$ {\text{DBI}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {\mathop {\hbox{max} }\limits_{{\begin{array}{*{20}c} {j = 1, \ldots ,N} \\ {j \ne i} \\ \end{array} }} \left( {\frac{{S_{i} + S_{j} }}{{d_{ij} }}} \right)} \right) $$
(6)

where \( N \) is the number of clusters and \( d_{ij} \) is the average linkage as between-cluster distance of clusters \( C_{i} \) and \( C_{j} \). Also, \( S_{i} \) and \( S_{j} \) are the average distance of within-cluster \( C_{i} \) and within-cluster \( C_{j} \), respectively.

$$ d_{ij} = \frac{{\mathop \sum \nolimits_{{p_{r} \in C_{i} ,p_{s} \in C_{j} }} p_{r} - p_{s} }}{{C_{i} C_{j} }} $$
(7)
$$ S_{i} = \frac{{\mathop \sum \nolimits_{{p_{r} ,p_{s} \in C_{i} }} p_{r} - p_{s} }}{{C_{i} \left( {C_{i} - 1} \right)}} $$
(8)

where \( \left\| \cdot \right\| \) is Algorithm 2 and \( p_{r} \in C_{i} \) means point \( r \) belongs to the cluster \( i \).

The DI focuses on compactness of clusters which are well-separated from others based on inter-cluster distance and cluster’s diameter, respectively. So, it calculates the minimum inter-cluster distance divided by the maximum cluster size. The best choice of clusters will be done, since the DI is maximized and the index is formulated as follows:

$$ {\text{DI}} = \frac{{\mathop {\hbox{min} }\nolimits_{{\begin{array}{*{20}c} {i,j = 1, \ldots ,N} \\ {j \ne i} \\ \end{array} }} \left( {D_{ij} } \right)}}{{\mathop {\hbox{max} }\nolimits_{k = 1, \ldots ,N} \left( {{\text{diam}}_{k} } \right)}} $$
(9)
$$ D_{ij} = \mathop {\hbox{min} }\limits_{{p_{r} \in C_{i} ,p_{s} \in C_{j} }} \left\| {p_{r} - p_{s} } \right\| $$
(10)
$$ {\text{diam}}_{k} = \mathop {\hbox{max} }\limits_{{p,p' \in C_{k} }} \left\| {p - p^{\prime } } \right\| $$
(11)

where \( \left\| \cdot \right\| \) is Algorithm 2, \( D_{ij} \) is the inter-cluster distance between clusters \( C_{i} \) and \( C_{j} \) and \( {\text{diam}}_{k} \) is the diameter of cluster \( C_{k} \).

The SI focuses on consistency within clusters based on comparison of the similarity of an object to its own cluster to other clusters. The index is calculated for a data point \( p_{i} \) in the cluster \( C_{i} \) and ranges from − 1 to + 1, where the values close to 1 show that the object is well matched to its own cluster and poorly matched to neighboring clusters. So, the best choice of clusters will be done, since the DI is maximized by being close to 1 and the SI’s average of the all data points is formulated as follows:

$$ {\text{SI}} = \frac{1}{{\left| {\text{Data}} \right|}}\mathop \sum \limits_{i = 1}^{{\left| {\text{Data}} \right|}} \left( {\frac{{b\left( {p_{i} } \right) - a\left( {p_{i} } \right)}}{{\hbox{max} \left\{ {a\left( {p_{i} } \right),b\left( {p_{i} } \right)} \right\}}}} \right) $$
(12)
$$ b\left( {p_{i} } \right) = \mathop {\hbox{min} }\limits_{{\begin{array}{*{20}c} {k \ne i} \\ {k = 1, \ldots ,N} \\ \end{array} }} \frac{1}{{\left| {C_{k} } \right|}}\mathop \sum \limits_{{p_{j} \in C_{k} }} \left\| {p_{i} - p_{j} } \right\| $$
(13)
$$ a\left( {p_{i} } \right) = \frac{1}{{\left| {C_{i} } \right| - 1}}\mathop \sum \limits_{{p_{j} ,p_{i} \in C_{k} }} \left\| {p_{i} - p_{j} } \right\| $$
(14)

where \( \left\| \cdot \right\| \) is Algorithm 2 and |Data| and \( \left| {C_{k} } \right| \) are the number of all data points and the number of data points in cluster k, respectively. \( a\left( {p_{i} } \right) \) is the mean distance between \( p_{i} \) and all other data points in the same cluster, and \( b\left( {p_{i} } \right) \) is the mean dissimilarity of \( p_{i} \) to the other clusters.

3.2 DBSCAN algorithm

DBSCAN (Han et al. 2011) is an information clustering method based on the data density whose brief procedure is presented in Algorithm 1. Two parameters, namely the neighborhood radius (ε) and minimum points (MinPts) (µ), needed to form a cluster have been used in order to evaluate the distributed density of points. This algorithm begins from an optional point, and then it accounts for the points which are located in the neighborhood radius of this point at a distance less than ε. If the number of points is more than µ parameter, they produce a cluster; otherwise, the intended point is known as outliner data. In the next step, this point may be recognized as a part of a cluster. The advantage of this method is the ability to distinguish and separate the outliner data from other data.

figure c

In this algorithm, the most important role is to find the proper \( \varepsilon \) and \( \mu \) points. Commonly, by using statistical and classical methods of combining different data mining ways, we can find these points. In many cases, despite consuming too much time, this is not run with high precision. Therefore, in the research, we try to use the earthworm optimization algorithm (EWOA) (Wang et al. 2018), as a meta-heuristic algorithm, to estimate the exact values of these parameters and achieve significant improvements.

3.3 The improved DBSCAN

To design the improved BDSCAN (EWOA–DBSCAN), the EWOA is adapted to find the optimum values of P, µ and ε in Algorithm 3. In the adapted EWOA to improve DBCSAN for an obtained 2D dataset (Algorithm 1) with |Data| objects/points and M attributes, each earthworm is an M + 2-dimensional array such as (15). The first M elements represent an initial point P at which DBSCAN starts. The element of M + 1 represents the neighboring radius (ε), and the last element represents the value of MinPts (µ).

$$ {\text{EW}}_{i} = \left\{ {\begin{array}{*{20}l} {{ \hbox{min} }_{ij} \le {\text{ew}}_{ij} \le { \hbox{max} }_{ij} ,} \hfill & {{\text{if}}\;1 \le j \le M} \hfill \\ {{\text{dis}}_{ \hbox{min} } \le {\text{ew}}_{ij} \le {\text{dis}}_{ \hbox{max} } ,} \hfill & {{\text{if}}\; j = M + 1} \hfill \\ {2 \le {\text{ew}}_{ij} , } \hfill & {{\text{if}}\;j = M + 2} \hfill \\ \end{array} } \right. $$
(15)
$$ { \hbox{min} }_{ij} = \mathop {\hbox{min} }\limits_{{1 \le j \le \left| {\text{Data}} \right|}} 2D\_{\text{Data}}\left( {i,j} \right) $$
(16)
$$ { \hbox{max} }_{ij} = \mathop {\hbox{max} }\limits_{{1 \le j \le \left| {\text{Data}} \right|}} 2D\_{\text{Data}}\left( {i,j} \right) $$
(17)
$$ {\text{dis}}_{ \hbox{min} } = \mathop {\hbox{min} }\limits_{{\begin{array}{*{20}c} {1 \le i,r \le \left| {\text{Data}} \right|} \\ {i \ne r} \\ \end{array} }} \left\| {p_{i} - p_{r} } \right\| $$
(18)
$$ {\text{dis}}_{ \hbox{max} } = \mathop {\hbox{max} }\limits_{{\begin{array}{*{20}c} {1 \le i,r \le \left| {\text{Data}} \right|} \\ {i \ne r} \\ \end{array} }} \left\| {p_{i} - p_{r} } \right\| $$
(19)

where \( i,r = 1, \ldots ,{\text{pop}}\_{\text{size}} \), \( { \hbox{min} }_{ij} \) and \( { \hbox{max} }_{ij} \) are the minimum and the maximum values of \( j{\text{th}} \) columns of \( 2D\_{\text{Data}} \), \( {\text{dis}}_{ \hbox{min} } \) and \( {\text{dis}}_{ \hbox{max} } \) are the minimum and the maximum distances between objects/points. Note that \( { \hbox{min} }_{ij} = 0 \) and \( { \hbox{max} }_{ij} = 1 \) for a normalized 2D data.

The EWOA’s inputs are population size (\( {\text{pop}}\_{\text{size}} \)), similarity factor (\( \alpha \)), the number of kept earthworms (\( n{\text{KEW}} \)), maximum generations (\( {\text{MaxGen}} \)) and/or other terminate criteria (Carvalho and Freitas 2004; Nagar and Srivastava 2008; Huang 1997; Freitas 2003). In Algorithm 4, a brief procedure of the EWOA is presented based on Initialization, Reproduction 1, Reproduction 2 and Selection.

Initialization: With regard to the earthworm’s structure (15), \( {\text{pop}}\_{\text{size}} \) earthworms are generated, as initial population in the zero generation \( {\text{EW(}}0 )= \left\{ {{\text{EW}}_{1}^{0} , \ldots ,{\text{EW}}_{{{\text{pop}}\_{\text{size}}}}^{0} } \right\} \) randomly that is \( {\text{EW}}_{i}^{0} = \left( {{\text{ew}}_{i1} , \ldots ,{\text{ew}}_{i,M + 2} } \right) \). To evaluate each earthworm, Algorithm 3 runs and the DBI (6) calculates its fitness function. The worms of \( {\text{EW}}\left( 0 \right) \) are sorted in ascending order, and the best one is saved as \( {\text{EW}}_{\text{best}} \).

Reproduction 1: The EWOA’s procedure includes two strategies to generate new offspring such as Reproduction 1 and Reproduction 2. The behavior of Reproduction 1 is similar mutation in GA, and first the operation is run. The Reproduction 1 has just effects on \( \varepsilon \) and \( \mu \) values, and one worm (\( {\text{EW}}_{i} \)) generates a new offspring (\( {\text{EW}}_{i}^{\text{new}} \)) by the following formula and saves it in \( R1\left( G \right) \):

figure d
$$ {\text{EW}}_{i}^{\text{new}} = \left\{ {\begin{array}{*{20}l} {{\text{ew}}_{i,j} ,} \hfill & {{\text{if}}\; 1 \le j \le M} \hfill \\ {{\text{dis}}_{ \hbox{max} } + {\text{dis}}_{ \hbox{min} } - \alpha \times {\text{ew}}_{i,j} , } \hfill & {{\text{if}}\; j = M + 1} \hfill \\ {2 \le {\text{ew}}_{i,j} + \left( {U\left( { - \alpha ,\alpha } \right) \times \left| {\text{Data}} \right| } \right),} \hfill & {{\text{if}}\; j = M + 2} \hfill \\ \end{array} } \right. $$
(20)

where \( \alpha \in \left[ {0,1} \right] \) is a similarity factor, and it can adjust the distance between the parent earthworm (\( {\text{EW}}_{i} \)) and its offspring (\( {\text{EW}}_{i}^{\text{new}} \)). Decreasing \( \alpha \) value causes reduces the distance. \( U\left( { - \alpha ,\alpha } \right) \) generates random real value in \( \left[ { - \alpha ,\alpha } \right] \).

Reproduction 2: There are different strategies to generate offspring in Reproduction 2 since earthworm’s index is bigger or smaller than \( n{\text{KEW}} \). In the adapted draft of EWOA for the current research, if the index is bigger than \( n{\text{KEW}} \) (\( i > n{\text{KEW}} \)), a earthworm is selected from \( {\text{EW(}}G ) \) like \( {\text{EW}}_{S} \) (\( i \ne S \)) based on the Roulette wheel. Then, the scattered crossover (Fig. 2) is run on \( {\text{EW}}_{S} \) and \( {\text{EW}}_{i} \) to generate two new offspring (\( {\text{EW}}_{1}^{\text{new}} \) and \( {\text{EW}}_{2}^{\text{new}} \)) and save them in \( R2\left( G \right) \). If the index is smaller than \( n{\text{KEW}} \) (\( i \le n{\text{KEW}} \)), the earthworm (\( {\text{EW}}_{i} \)) is known as the elite worm and is stored in \( R2\left( G \right) \) by changing \( \varepsilon \) and \( \mu \) values as follows:

Fig. 2
figure 2

Scattered crossover procedure

$$ {\text{EW}}_{i}^{\text{new}} = \left\{ {\begin{array}{*{20}l} {{\text{ew}}_{i,j} , } \hfill & {{\text{if}}\;1 \le j \le M} \hfill \\ {{\text{dis}}_{ \hbox{max} } + {\text{dis}}_{ \hbox{min} } - \frac{\alpha }{2} \times {\text{ew}}_{i,j} , } \hfill & { {\text{if}}\; j = M + 1} \hfill \\ {2 \le {\text{ew}}_{i,j} + \left( {U\left( {\frac{ - \alpha }{2},\frac{\alpha }{2}} \right) \times \left| {\text{Data}} \right| } \right),} \hfill & {{\text{if}}\;j = M + 2} \hfill \\ \end{array} } \right. $$
(21)

Selection: There are \( {\text{EW}}\left( G \right) \cup R1\left( G \right) \cup R2\left( G \right) \) earthworms in the last generation G, and the evaluation of offspring worms in \( R1\left( G \right) \cup R2\left( G \right) \) is done by running Algorithm 3 and (6). To reduce the number of worms to \( {\text{pop}}\_{\text{size}} \) one for considering in the next generation (EW \( \left( {G + 1} \right) \)), EW \( \left( G \right) \cup R1\left( G \right) \cup R2\left( G \right) \) and the top \( {\text{pop}}\_{\text{size}} \) worms are sorted in the ascending order. Also, the best one is saved as \( {\text{EW}}_{\text{best}} \).

The challenge for the EWOA is to optimize the determination of the Reproduction 1 (\( \alpha \)) and Reproduction 2 (\( n{\text{KEW}} \)) parameters. These parameters are empirically determined and have a significant impact on the efficiency, accuracy and speedup of the algorithm.

3.4 The soft improved DBSCAN algorithm

To fill up the above challenge and to design the soft improved DBSCAN (EWOA–DBSCAN–FLC), a new self-adaptive EWOA based on a new fuzzy logic controller (FLC) is designed in this subsection.

figure e

Algorithm 5 has the procedure like Algorithm 4, except calculating \( \alpha \) and \( n{\text{KEW}} \) by the proposed FLC (see Fig. 3) based on the inputs such as the \( {\text{UG}} \) and the \( {\text{fitBest}} \). The FLC’s inputs are normalized values related to generation, and the evaluation function (6) that they calculate is as follows:

Fig. 3
figure 3

Proposed fuzzy logic controller based on two inputs \( \left( {{\text{UG}}, {\text{fitBest}}} \right) \) and two outputs \( \left( {\alpha ,\;n{\text{KEW}}} \right) \)

$$ {\text{UG}} = \frac{G}{{{\text{Max}}G}} $$
(22)
$$ {\text{fitBest}} = \frac{{{\text{DBI}}\left( { {\text{EW}}_{\text{best}} } \right)}}{{{ \hbox{min} }\left( {{\text{DBI}}\left( {{\text{EW}}_{S} } \right),{\text{DBI}}\left( {{\text{EW}}_{i} } \right)} \right)}} $$
(23)

The FLC consists of fuzzifying the inputs, linguistic logic strategy (LLS) and defuzzifying the outputs. The inputs fuzzify based on the presented membership functions in Fig. 4 . The LLS includes two main parts: rule based and inference engine. There are two group rules, which are called Mamdani’s rules and Takagi–Sugeno’s rules, because the FLC is designed to generate dynamic outputs based on Mamdani (Mamdani and Assilian 1975) and Takagi–Sugeno’s inferences (Takagi and Sugeno 1993). The LLS creates outputs’ surfaces in Figs. 5 and 6 by Mamdani’s rules and Takagi–Sugeno’s rules, respectively. Maximum and minimum operations are considered for “OR” and “AND” operators in the LLS’s inference engine to aggregation functions and reasoning. The center of gravity is used for defuzzification method.

Fig. 4
figure 4

Membership functions of two inputs \( \left( {{\text{UG}}, {\text{fitBest}}} \right) \) based on linguistic values of low, medium and high

Fig. 5
figure 5

Outputs’ surfaces of the LLS based on Mamdani’s rules

Fig. 6
figure 6

Outputs’ surfaces of the LLS based on Takagi–Sugeno’s rules

4 Evaluation of the improved DBSCAN algorithm and the soft improved DBSCAN

To evaluate the effectiveness of the proposed algorithms, experiments were performed on the Intel Core i5 3.2 GHz CPU and 4.00 GB memory. The algorithms were implemented in MATLAB 2017a. Six benchmarks datasets of the data cube, which are available from UCI and considered for experimentation, are shown in Table 1.

Table 1 Investigated data cube

In order to achieve scalability of the proposed data cube clustering algorithms, each cell of the investigated 3D benchmarks such as \( \left( {i,j,k} \right) \) is assigned by a given address like r based on (1) and the addresses have inverse functions (2), (3) and (4) to calculate their \( \left( {i,j,k} \right) \). The data are normalized based on (5), and their related 2D data are obtained by dimension move (Algorithm 1). To analyze DBSCAN2, which utilizes the proposed similarity metric (Algorithm 2) in the procedure of DBSCAN clustering (Algorithm 3) and the investigation evaluation indices, DBSCAN1 is considered based on the Euclidean metric in Algorithm 3 and the indices for its evaluation.

Eight drafts of the proposed clustering algorithms are preformed, namely DBSCAN1, DBSCAN2, EWOA–DBSCAN1, EWOA–DBSCAN2, EWOA–DBSCAN1-Mamdani, EWOA–DBSCAN2-Memdani, EWOA–DBSCAN1-Sugeno and EWOA–DBSCAN2-Sugeno. There are eight parameters in the experimentation, namely µ and ε in Algorithm 3, \( {\text{pop}}\_{\text{size}} \), \( \alpha \), \( n{\text{KEW}} \) and \( {\text{Max}}G \) in Algorithm 4 and \( {\text{pop}}\_{\text{size}} \) and \( {\text{Max}}G \) in Algorithm 5. Tuning parameters of Algorithm 3 are based on ε = 0.5 and µ is 10% of the investigated data, because other values increased the DBI and number of clusters simultaneously. Also, Algorithm 4 was tested on the data cube of “Daily Demand Forecasting Orders” by different values for \( \alpha \) and \( n{\text{KEW}} \); then, \( \alpha = 0.8 \) and \( n{\text{KEW}} = \left[ {0.2 \times {\text{pop}}\_{\text{size}}} \right] \), which had the best results, were considered for experimentations of all data cubes. Finally, \( {\text{pop}}\_{\text{size}} \) and \( {\text{Max}}G \) were tuned with 100 earthworms and 100 generations.

In order to compare the behavior of the EWOA in improving the designed DBSCAN with other meta-heuristics, a genetic algorithm (GA) with the scattered crossover (Fig. 2), single point mutation, \( P_{\text{c}} = 0.8 \) (crossover rate), \( P_{\text{m}} = 0.02 \) (mutation rate), \( {\text{pop}}_{\text{size}} = 100 \) and \( {\text{MaxItr}} = 100 \) is performed and called GA-DBSCAN1. Also, \( P_{\text{c}} \) and \( P_{\text{m}} \) are dynamically tuned by the FLC (Fig. 3) in GA-DBSCAN1-Mamdani and GA-DBSCAN1-Sugeno.

The algorithms are run 20 times on each data cube, and then the best obtained DBI (6), DI (9) and SI (12) are reported as the best quality clustering of the data. The details of the obtained results from implementations are shown in “Appendix” and Tables 6, 7, 8, 9, 10, 11, 12 and 13. The best obtained DBI (6), DI (9) and SI (12) are summarized in Tables 2, 3 and 4, respectively. With regard to the tables, the performance of the EWOA–DBSCAN2-Sugeno (Algorithm 3 based on Takagi–Sugeno’s rules) is significantly superior to that of the other performed algorithms.

Table 2 Best results (DBI) after 20 runs
Table 3 Best results (Dunn index) after 20 runs
Table 4 Best results (Silhouette index) after 20 runs

Also, a comparison of the obtained results of DBSCAN1 and DBSCAN2 can show that except Dow Jones Index dataset for the DBI and except ADL Recognition dataset for Dunn and Silhouette indices, DBSCAN2 on the other datasets obtains between 3.6 and 25.5% better results in Table 2, between 1.6 and 33.6% better results in Table 3 and between 1.5 and 188.3% better results in Table 4. So, the newly proposed similarity metric is more efficient than Euclidean metric.

To compare the functionality of the proposed data cube clustering algorithms, the curve convergences of the best DBI, DI and SI are shown on the datasets in Figs. 7, 8, 9, 10, 11 and 12. The horizontal axis of the figures is measured based on the number of generations from 1 to 100, and the vertical axis is denoted by the best found DBI, DI and SI through improvement. Based on the figures, the EWOA has more control over the search space than the GA. Although the EWOA and the GA have almost the same evolutionary structure, the EWOA generates more offspring than the GA in a generation that the fact enhances the search process and increases the diversity of search points.

Fig. 7
figure 7

Convergence curves for the Daily Demand Forecasting Orders based on three investigation evaluation indices

Fig. 8
figure 8

Convergence curves for the Istanbul Stock Exchange based on three investigation evaluation indices

Fig. 9
figure 9

Convergence curves for the Dow Jones Index based on three investigation evaluation indices

Fig. 10
figure 10

Convergence curves for the ADL Recognition based on three investigation evaluation indices

Fig. 11
figure 11

Convergence curves for the Software Engineering Teamwork based on three investigation evaluation indices

Fig. 12
figure 12

Convergence curves for the User Identification from Walking Activity based on three investigation evaluation indices

To evaluate the significance level of the comparisons for the proposed data cube clustering algorithms, a hypothesis test is done to test the difference in the resulting quality between the algorithms. Because the obtained results of each algorithm are not normally distributed, a nonnormally distributed hypothesis test, such as the Wilcoxon signed-rank test, is utilized in SPSS between two samples at a significant level of α = 0.05. The results are presented in the form of [Z, P] in Table 5. If P value < 0.05, then the null hypothesis (the two samples are dependent samples) can be rejected at the 95% level, but if P value > 0.05, then the null hypothesis cannot be rejected. Therefore, the bold P values present that the comparison of two mentioned clustering algorithms on the related datasets is significant at the 95% level.

Table 5 Results of the Wilcoxon signed-rank test in the form of [Z, P]

5 Conclusion and discussion

This paper focuses on the data cube clustering, and the DBSCAN algorithm is considered as the basic clustering technique, because it successfully recognizes nonconvex clusters and does not depend on a given number of clusters as an input. Since the efficiency of the DBSCAN Algorithm is highly dependent on the parameters of the neighboring radius and the number of neighbors, the study tries to improve this algorithm.

For this purpose, first new techniques were proposed in the preprocessing section. Assigning a unique address (1) to each cube cell for clarifying and interpreting the scalability of the proposed data cube clustering algorithms, obtaining the related 2D data from the 3D data by moving dimension (Algorithm 1) and designing similarity metric (Algorithm 2) are those techniques. The DBSCAN1 and the DBSCAN2 were performed based on embedding Euclidean and new similarity metrics in Algorithm 3, respectively.

The EWOA was adapted to improve DBSCAN (Algorithm 4) and find the optimum values for P, µ and ε in Algorithm 3 (EWOA–DBSCAN1 and EWOA–DBSCAN2). To fill up the clustering algorithms’ challenges and enhance the exploration and exploration capabilities of the algorithms, a new FLC (Fig. 3) was designed to dynamically tune the EWOA’s parameters (Algorithm 5) and EWOA–DBSCAN1-Mamdani, EWOA–DBSCAN2-Mamdani, EWOA–DBSCAN1-Sugeno and EWOA–DBSCAN2-Sugeno were performed for analyzing the efficiency of the proposed ideas. The designed FLC has been carried out by Mamdani’s rules and Takagi–Sugeno’s rules.

For comparison of the obtained results, the GA was considered to embed in Algorithms 4 and 5 and GA-DBSCAN1, GA-DBSCAN1-Mamdani and GA-DBSCAN1–Sugeno were performed on the experimental results. Three investigation evaluation indices were considered to find the most similarities between members of the cluster, and on the other hand, they can provide separation issue which is related to less similarity with other clusters.

To evaluate and compare the proposed clustering algorithms, six datasets of data cube were considered and the details of the obtained results are reported in “Appendix”. All experiments indicated the efficiency and improvement of the DBSCAN2 compared to the DBSCAN1 and the EWOA–DBSCAN2–Sugeno compared to the others. In addition, the ρ values of the Wilcoxon signed-rank test were calculated to recognize the significant level of the comparisons for the proposed algorithms. With regard to Table 5, comparisons of the most of the algorithms were significant at the 95% level for all datasets.