Keywords

1 Introduction

Outliers have a significant negative impact on the data quality. According to Hawkins (1980), an outlier is an observation that deviates so much from the other observations in the dataset as to arouse suspicious that it was generated by some different mechanism than other data entities.

Detection of outliers is a fundamental issue in data analysis, its main goal is to detect and remove anomalous objects from the data. Because the technology changes rapidly, number of databases and their size grows over time.

There are many different methods for outlier detection than can be used, starting from the simplest ones to more complex ones like feature bagging, subsampling, rotated bagging, isolation forests, outrank method, the approach proposed by Nguyen et al. (2010) (see, for example, Aggrawal 2013, 2015; Aggrawal and Sathe 2017; Nguyen et. al. 2010).

Usually outlier detection algorithms are seen as statistical models of data that allow identify objects that do not fit the model, whereas the aim of the distance-based approaches is to measure the distance between data. In such case, outliers are the data for which the distance is greater than given threshold (see, for example, Aggrawal 2013, 2015; Zhang 2013). One of such distance-based algorithms is the DBSCAN (density-based spatial clustering algorithm for applications with noise) algorithm.

Ester et al. (1996) have proposed a density-based algorithm for discovering clusters for classical data. This algorithm groups together points that have many neighbors and also detects outliers that are left in low-density subregions. In 1998, Sander et al. have proposed a generalized version of DBSCAN—the GDBSCAN. The generalized algorithm can cluster objects as well as spatially extend objects according to both their spatial and non-spatial attributes (Sander et al. 1998). Compello et al. (2013) have proposed a hierarchical version of DBSCAN. In 1999, Ankrest et al. proposed the OPTICS algorithm that extends the ideas of DBSCAN (Ankrest et al. 1999). Other DBSCAN extensions are SUBCLU and PreDeCon (Jahirabadkar, Kulkarni 2013; Kailing et al. 2004). Both use similar subspace clustering ideas that are similar to those of DBSCAN. WaveCluster (Sheikholeslami et al. 1998) is another modification of DBSCAN. It uses the wavelet transform to the dimension space and unfortunately is applicable only to low-dimensional datasets. DenClue (Hinnburg and Keim 1998) is another efficient algorithm that uses information about density to cluster objects.

The paper presents an application of ensemble learning for symbolic data as a tool for outlier detection, where the DBSCAN algorithm is applied. Also it analyzes how DBSCAN’s initial parameters impact the number of detected outliers and the clustering quality itself. It is also the first paper that deals the problem of outlier detection with application of DBCSAN for symbolic unbalanced data sets.

In the empirical part of this paper, ensemble learning and single DBSCAN algorithm with different distance measures is used to detect outliers in an unbalanced data sets. The paper presents also an impact of the parameters that are essential for DBSCAN on the outlier detection and partition quality (in terms of Silhouette index).

2 DBSCAN Ensemble for Symbolic Data

In the classical data, each object is described by a set of single-valued variables. Such situation allows to describe objects as a vector of quantitative or qualitative measurements where each column represents a variable. However, the classical approach may be too restrictive to represent more complex data. In order to take into consideration uncertainty and variability of the data, we must assume sets of categories or intervals with frequencies or weights. This kind of data representation has been studied in the Symbolic Data Analysis (SDA).

In symbolic data analysis, each symbolic object can be described by following variables (Bock and Diday 2000; Billard and Diday 2006; Diday and Noirhomme-Fraiture 2008; Noirhomme-Fraiture and Brito 2011):

  1. 1.

    Quantitative (numerical) variables:

    • numerical single-valued,

    • numerical multi-valued,

    • interval variables,

    • histogram variables.

  2. 2.

    Qualitative (categorical) variables:

    • categorical single-valued,

    • categorical multi-valued,

    • categorical modal.

Examples of symbolic variables with their realizations are presented in Table 1.

Table 1 Examples of symbolic variables with realizations

As the objects in the symbolic data analysis are described by non-classical variables, we can describe any type of phenomena in a more detailed way. However, symbolic data representation requires to apply special distance measures, methods, and algorithms that can deal with complex data.

DBSCAN algorithm that will be used in the empirical part of this paper has several advantages over traditional partitioning techniques, like possibility to detect non-spherical shapes of clusters, groups of different size, and robustness against outliers. But sometimes the DBSCAN algorithm can lead to large number of clusters and the interpretation of such results can be difficult. Some authors suggest to use clustering visualization methods as the support for DBSCAN (Nowak-Brzezińska and Xięski 2014).

The DBSCAN algorithm for symbolic data requires to select two initial parameters: \(\varepsilon\) and minPts. The \(\varepsilon\) parameter controls how similar are the objects in the same group. The mintPts is the minimal number of objects that are needed to form a cluster. The minPts value can be derived from the number of dimensions in the data set (D) as minPts ≥  D + 1. If minPts = 1, every data point will be a cluster. When minPts ≤ 2, the results will be the same as for hierarchical clustering with the single metric, with the dendrogram cut at height of ε. So minPts should be at least 3. Larger minPts values are useful for data sets with noise and larger data sets. In general, minPts should be equal or greater than data dimensionality (number of variables). Sander et al. (1998) suggest to use minPts that is twice bigger than number of variables.

The ε can be found by using a k-distance graph (see Sander et al. 1998 for further details) and plotting the distance to the k =  minPts. If ε are too small, a large part of the data set will not be clustered. Too large values will merge almost all data points.

There are also other proposals in the literature that deal with the problem of parameter selection for DBSCAN. Some of them use differential evolutions, some propose to detect sharp distance increases generated by a function which computes a distance between each element of a data set and its kth nearest neighbor others propose to use some other clustering algorithm as the initial tool (see, for example, Starczewski et al. 2020; Karami and Johansson 2014; Chen et al. 2019).

For the DBSCAN algorithm also a suitable distance measure for symbolic data is needed. In the paper, three of them will be compared (Gatnar and Walesiak 2011):

  1. 1.

    Normalized Ichino-Yaguchi distance (unweighted):

    $$d\left({A}_{i},{A}_{k}\right)=\sqrt[q]{\sum_{j=1}^{m}\psi {\left({v}_{ij},{v}_{kj}\right)}^{q}},$$
    (1)

    where \(\psi \left({v}_{ij},{v}_{kj}\right)=\frac{\varphi \left({v}_{ij},{v}_{kj}\right)}{|{V}_{j}|}\) with \(\varphi \left({v}_{ij},{v}_{kj}\right)=\left|{v}_{ij}\oplus {v}_{kj}\right|-\left|{v}_{ij}\otimes {v}_{kj}\right|+\gamma \left(2\cdot \left|{v}_{ij}\oplus {v}_{kj}\right|-\left|{v}_{ij}\right|-\left|{v}_{kj}\right|\right)\), \({v}_{ij}, {v}_{kj}\)—symbolic variables, \(\oplus\)—Cartesian sum, \(\otimes\)—Cartesian product, ||—length of a symbolic interval-valued variable or number of elements in a symbolic categorical multi-valued variable, \({V}_{j}\)—domain of a symbolic variable, \(\gamma \in \left[0,\frac{1}{2}\right]\).

  2. 2.

    Normalized de Carvalho distance based on description potential:

    $$d\left({A}_{i},{A}_{k}\right)=\frac{\left[\pi \left|{A}_{i}\oplus {A}_{k}\right|-\pi \left|{A}_{i}\otimes {A}_{k}\right|+\gamma \left(2\pi \left|{A}_{i}\oplus {A}_{k}\right|-\pi \left|{A}_{i}\right|-\pi \left|{A}_{k}\right|\right)\right]}{\pi \left({A}^{E}\right)},$$
    (2)

    where: \({A}^{E}\)—maximum object according to the description potential, \(\pi \left({A}_{i}\right)\)—description potential of a symbolic object, other elements like in Eq. 1.

  3. 3.

    Second normalized de Carvalho distance that is based on description potential:

$$d\left({A}_{i},{A}_{k}\right)=\frac{\left[\pi \left|{A}_{i}\oplus {A}_{k}\right|-\pi \left|{A}_{i}\otimes {A}_{k}\right|+\gamma \left(2\pi \left|{A}_{i}\oplus {A}_{k}\right|-\pi \left|{A}_{i}\right|-\pi \left|{A}_{k}\right|\right)\right]}{\pi \left({A}_{i}\oplus {A}_{k}\right)},$$
(3)

where all elements as in Eq. 3.

Other distances for symbolic data are described in Bock and Diday (2000), Gatnar and Walesiak (2011).

The ensemble learning, in general, means aggregation of results of many different models into one model that reaches better results. Such an idea was successfully applied both in supervised and unsupervised approaches for classical and symbolic data. However, this idea can be also used to detect outliers (see, e.g., Aggrawal and Sathe 2017). For all distance measure computations the R package symbolicDA will be applied (see: Walesiak et. al. 2018).

In this paper, two different symbolic distance measures, minPts’ and ε, are used to obtain one ensemble model that allows to detect outliers in more precise way. The results of the ensemble model will be compared to single models.

3 Simulations and Their Results

To check if the DBSCAN for symbolic data can be a suitable tool for outlier detection in real data sets, unbalanced symbolic data sets were prepared for experiments with application of cluster.Gen function from clusterSim algorithm (Walesiak and Dudek 2020):

  1. 1.

    Data set I that contains 100 symbolic objects in three elongated clusters of equal size and 10 outliers in two dimensions. The observations in each cluster are independently drawn from bivariate normal distribution with means \(\left(0, 0\right), \left(1.5, 7\right), \left(3, 14\right)\) and covariance matrix \(\sum \left({\sigma }_{jj}=1, {\sigma }_{jl}=-0.9\right).\)

  2. 2.

    Data set II that contains 190 symbolic objects and 20 outliers in four clusters of following sizes \(\left(70, 40, 30, 30\right)\) that are described by three variables. The observations are drawn from multivariate normal distribution \(\left(-4, 5, -4\right), \left(4, 14, 5\right),\left(14, 5, 14\right), \left(5, -4, 5\right)\) and identity variance matrix Σ, where \({\sigma }_{jj}=1 \left(1\le j\le 3\right)\) and \({\sigma }_{jl}=0 \left(1\le j\ne l\le 3\right)\).

  3. 3.

    Data set III that contains 180 observations five clusters of following sizes \(\left(20, 30, 40, 50, 50\right)\) that are not well separated and 20 outliers in two dimensions. The observations are independently drawn from bivariate normal distribution with means \(\left(5, 5\right), \left(-3, 3\right), \left(3, -3\right), \left(0, 0\right),\left(-5, -5\right)\) and identity covariate matrix \(\sum \left({\sigma }_{jj}=1, {\sigma }_{jl}=0.9\right)\).

  4. 4.

    For the DBSCAN algorithm, the following initial parameters have been assumed:

  5. 5.

    As we have two or three variables in the data sets then minPts will be equal to 3, 5, and 7.

  6. 6.

    The \(\varepsilon\) was selected by using a k-distance graph (see Sander et. al. 1998) and plotting the distance to the k = minPts.

  7. 7.

    Table 2 presents the results for the minPts values for all distances that were taken into consideration in this research.

    Table 2 Results of simulations—minPts parameter

The larger the number of minPts parameter in the model, the greater the number of detected outliers and better clustering quality. Similar results were reached for classical data by Nowak-Brzezińska and Xsięski (2017), p. 65.

In the case of the minPts parameter normalized de Carvalho distances perform usually better, in terms of clustering quality, than normalized Ichino-Yaguchi distance.

Table 3 presents the results for the ε parameter for all distances.

Table 3 Results of simulations—ε parameter

If the parameter ε is small then  larger number of outliers is being detected and better clustering quality is archived in general. The choice of a distance measure is quite important and usually normalized de Carvalho distances perform better than normalized Ichino-Yaguchi distance.

Table 4 presents the results for aggregated model.

Table 4 Results of simulations—distances and all models

When considering an ensemble model for each datasets, minPts and ε values, we can see that aggregated models allow to detect more existing outliers and their quality is also better.

4 Final Remarks

The DBSCAN algorithm can be easily applied for the symbolic data case. The only thing that differs it from the classical version of the DBSCAN is the distance measure for symbolic data.

When looking at the DBSCAN’s parameters—minPts (minimal number of data points to form a cluster) and ε (maximum distance for objects in a cluster), both have significant impact on the clustering results—both in terms of clustering quality and number of outliers that have been detected. Higher minPts values lead to larger number of outliers in the data and also to higher clustering quality (that was measured by Silhouette index). However, higher ε values lead to lower number of clusters and less outliers and usually worse clustering quality. So selection of initial parameters can lead to different clustering results. Similar results were obtained by Nowak-Brzezińska and Xsięski (2014) for classical data sets with outliers.

Ensemble approach that uses information from different models (three different distance measures, three different minPts values, and three different ε values) allows to save some time on initial parameters tuning and as the most important fact it leads to better clustering results in terms of clustering quality and number of detected outliers in unbalanced datasets.